[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170210T0000). [00:00:04] kaldari and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:10] here [00:03:30] I can SWAT [00:03:32] \o [00:04:26] (03PS2) 10Thcipriani: Set $wgPageAssessmentsSubprojects to true on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336871 (https://phabricator.wikimedia.org/T157654) (owner: 10Kaldari) [00:04:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336871 (https://phabricator.wikimedia.org/T157654) (owner: 10Kaldari) [00:06:39] (03Merged) 10jenkins-bot: Set $wgPageAssessmentsSubprojects to true on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336871 (https://phabricator.wikimedia.org/T157654) (owner: 10Kaldari) [00:06:47] (03CR) 10jenkins-bot: Set $wgPageAssessmentsSubprojects to true on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336871 (https://phabricator.wikimedia.org/T157654) (owner: 10Kaldari) [00:10:53] thcipriani: ready for me to test on 1002? [00:11:26] kaldari: one moment please. [00:12:13] kaldari: okie doke, now it's on mwdebug1002, sorry for the dleay [00:12:16] *delay [00:12:22] no worries! [00:14:03] thcipriani: Something seems to be going wrong: [00:14:03] https://logstash.wikimedia.org/app/kibana#/discover/DBQuery?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-30d,mode:quick,to:now))&_a=(columns:!(_source),filters:!(),index:%27logstash-*%27,interval:auto,query:(query_string:(analyze_wildcard:!t,query:%27channel:DBQuery%20AND%20(level:ERROR%20OR%20level:WARNING)%27)),sort:!(%27@timestamp%27, [00:14:03] desc)) [00:14:08] not sure what though [00:15:25] nothing on Grafana though [00:15:58] looks like a lot of warnings are being generated [00:16:39] thcipriani: I guess we should roll it back for now. Not sure what's going on there. [00:16:51] I'll revert the change... [00:17:08] (03PS1) 10Kaldari: Revert "Set $wgPageAssessmentsSubprojects to true on English Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336942 [00:17:17] hrm, okie doke [00:17:20] thcipriani: https://gerrit.wikimedia.org/r/#/c/336942/ [00:18:32] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336942 (owner: 10Kaldari) [00:18:52] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.008 second response time [00:19:48] (03Merged) 10jenkins-bot: Revert "Set $wgPageAssessmentsSubprojects to true on English Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336942 (owner: 10Kaldari) [00:19:56] (03CR) 10jenkins-bot: Revert "Set $wgPageAssessmentsSubprojects to true on English Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336942 (owner: 10Kaldari) [00:20:31] kaldari: ok, revert now on mwdebug1002 [00:22:58] thcipriani: Well that didn't change anything. I guess it's just normal for us to have a few thousand db warnings an hour. I guess I should stick to just watching Grafana next time :P [00:23:44] thcipriani: sorry for panicking [00:23:49] it would have been surprising if mwdebug1002 were the sole cause of the flood :) [00:23:54] yeah [00:24:07] OK, let's try it again.... [00:24:12] with feeling this time... [00:24:15] cool :) [00:24:18] kaldari: kaldari o/ [00:24:23] lemme get out ebernhardson 's patch :) [00:24:26] sure [00:24:54] thcipriani: \o/ [00:25:18] ebernhardson: your change is live on mwdebug1002, check please [00:26:58] thcipriani: no glaring problems. should be fine. [00:27:35] (03PS1) 10Kaldari: Set $wgPageAssessmentsSubprojects to true on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336944 (https://phabricator.wikimedia.org/T157654) [00:27:40] ebernhardson: fine to do a sync-dir on it? [00:27:58] thcipriani: yea [00:30:18] (03PS7) 10EBernhardson: Update elasticsearch module for es5 compatability [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) [00:31:25] (03CR) 10EBernhardson: "The number of changes required has really expanded, i can split this patch up if desired. Reasonable split might be a pre-patch to remove " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [00:32:11] !log thcipriani@tin Synchronized php-1.29.0-wmf.11/extensions/WikimediaEvents: SWAT: [[gerrit:336896|Enable Sister project search AB test]] T149806 (duration: 00m 45s) [00:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:17] T149806: [A/B/C Test] Add cross-wiki search results in a right sidebar - https://phabricator.wikimedia.org/T149806 [00:32:17] ^ ebernhardson sync'd [00:33:03] kaldari: this the one? https://gerrit.wikimedia.org/r/#/c/336944/ [00:33:39] thcipriani: Yeah, thanks. It's identical to the old one [00:33:53] :) [00:34:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336944 (https://phabricator.wikimedia.org/T157654) (owner: 10Kaldari) [00:34:22] thcipriani: sweet, i'll watch our logging to make sure its all gravy. thanks [00:36:52] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.010 second response time [00:41:40] (03Merged) 10jenkins-bot: Set $wgPageAssessmentsSubprojects to true on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336944 (https://phabricator.wikimedia.org/T157654) (owner: 10Kaldari) [00:41:49] (03CR) 10jenkins-bot: Set $wgPageAssessmentsSubprojects to true on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336944 (https://phabricator.wikimedia.org/T157654) (owner: 10Kaldari) [00:42:40] kaldari: live on mwdebug1002, check please [00:42:47] checking... [00:44:42] thcipriani: looks kosher, but there is a bit of an error spike right now (likely not from this). OK to wait a couple minutes? [00:44:45] https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json [00:45:32] yeah, fine to wait a couple of minutes. [00:47:04] hrm. that spike lines up with the last sync I did. [00:47:17] but I don't see anything in logstash that explains it really. [00:47:33] ^ ebernhardson everything look normal to you post-change-sync? [00:48:52] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.010 second response time [00:49:27] hrm, well. Rate seems to be dropping back down to normal... [00:50:05] thcipriani: yup [00:50:31] thcipriani: i mean, its hard to say for 100%, we just enabled the test on 4 wikis either in the middle east or europe, so its the middle of the night. But nothing looks broken :) [00:50:58] yeah, error logs look normal for some value of normal [00:51:29] thcipriani: OK, I think you can sync my change whenever you're ready [00:51:33] must have just been a coincidence [00:51:38] kaldari: okie doke. [00:52:30] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team: Better mysql command prompt info - https://phabricator.wikimedia.org/T157714#3015480 (10Reedy) [00:53:12] !log transcode high-prio queue may be briefly blocked by an influx of low-res transcodes queued in bulk. should return to normal in a bit. [00:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:56] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:336944|Set $wgPageAssessmentsSubprojects to true on English Wikipedia]] T157654 (duration: 00m 43s) [00:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:00] T157654: Turn on subproject support for PageAssessments in production - https://phabricator.wikimedia.org/T157654 [00:54:08] ^ kaldari live everywhere [00:54:14] checking.. [00:55:11] thcipriani: looks good. Thanks for your patience :) [00:55:24] kaldari: absotively, glad everything's working :) [01:01:57] !log transcode queue back to normal [01:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:22] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:10:52] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.008 second response time [01:30:22] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [01:31:57] going to sneak in a late SWAT patch for a problem i just found [01:31:59] (03CR) 10EBernhardson: [C: 032] Setup sister search prefix display types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334721 (https://phabricator.wikimedia.org/T149806) (owner: 10EBernhardson) [01:34:52] (03PS2) 10EBernhardson: Setup sister search prefix display types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334721 (https://phabricator.wikimedia.org/T149806) [01:34:58] (03CR) 10EBernhardson: [C: 032] Setup sister search prefix display types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334721 (https://phabricator.wikimedia.org/T149806) (owner: 10EBernhardson) [01:37:05] (03Merged) 10jenkins-bot: Setup sister search prefix display types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334721 (https://phabricator.wikimedia.org/T149806) (owner: 10EBernhardson) [01:37:13] (03CR) 10jenkins-bot: Setup sister search prefix display types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334721 (https://phabricator.wikimedia.org/T149806) (owner: 10EBernhardson) [01:43:11] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: Setup sister search prefix display types T149806 (duration: 00m 48s) [01:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:15] T149806: [A/B/C Test] Add cross-wiki search results in a right sidebar - https://phabricator.wikimedia.org/T149806 [01:44:47] (03PS1) 10EBernhardson: InterwikiPrefixContentTypes -> InterwikiPrefixDisplayTypes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336950 [01:45:15] (03CR) 10EBernhardson: [C: 032] InterwikiPrefixContentTypes -> InterwikiPrefixDisplayTypes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336950 (owner: 10EBernhardson) [01:46:44] (03Merged) 10jenkins-bot: InterwikiPrefixContentTypes -> InterwikiPrefixDisplayTypes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336950 (owner: 10EBernhardson) [01:46:52] (03CR) 10jenkins-bot: InterwikiPrefixContentTypes -> InterwikiPrefixDisplayTypes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336950 (owner: 10EBernhardson) [01:48:11] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: Setup sister search prefix display types T149806 (duration: 00m 40s) [01:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:33] (03CR) 10Dzahn: "so this works only for new hosts, it does not work to remove hosts from icinga that are already in it :/" [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [01:56:01] (03PS1) 10Faidon Liambotis: apt: config a proxy for mirantis only where needed [puppet] - 10https://gerrit.wikimedia.org/r/336952 [01:57:47] (03CR) 10Andrew Bogott: [C: 032] "Of course this is better :)" [puppet] - 10https://gerrit.wikimedia.org/r/336952 (owner: 10Faidon Liambotis) [01:57:55] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#3015605 (10Dzahn) 05Open>03Resolved @hashar Thank you for this. Last night i was thinking about this and i was planning to write almost the same... [01:58:17] ^ look at that epic task being closed [01:58:59] we basically do all lint checks now, without exceptions and no warnings or errors except a few ignored ones [02:02:23] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#3015609 (10Dzahn) What can still be done here optionally is remove "lint::ignore" lines and fix the issues that they are ignoring. ``` manifests/si... [02:03:59] (03PS1) 10Dzahn: Revert "icinga/base: test skipping icinga monitoring for mc2016 by host" [puppet] - 10https://gerrit.wikimedia.org/r/336953 [02:04:15] (03PS1) 10Andrew Bogott: WIP Apt: Remove an ensure->absent stanza [puppet] - 10https://gerrit.wikimedia.org/r/336954 [02:04:29] (03CR) 10Dzahn: "was just a test, this is applied via role::spare, but it does not actively remove existing monitoring. it would only skip adding it." [puppet] - 10https://gerrit.wikimedia.org/r/336953 (owner: 10Dzahn) [02:05:18] (03CR) 10Dzahn: [C: 032] Revert "icinga/base: test skipping icinga monitoring for mc2016 by host" [puppet] - 10https://gerrit.wikimedia.org/r/336953 (owner: 10Dzahn) [02:05:25] (03PS2) 10Dzahn: Revert "icinga/base: test skipping icinga monitoring for mc2016 by host" [puppet] - 10https://gerrit.wikimedia.org/r/336953 [02:05:33] (03CR) 10Dzahn: [V: 032 C: 032] Revert "icinga/base: test skipping icinga monitoring for mc2016 by host" [puppet] - 10https://gerrit.wikimedia.org/r/336953 (owner: 10Dzahn) [02:06:07] (03CR) 10Andrew Bogott: [C: 04-2] "do not merge for a few weeks!" [puppet] - 10https://gerrit.wikimedia.org/r/336954 (owner: 10Andrew Bogott) [02:07:02] looks at why Icinga config is broken .. my revert should fix that too [02:08:47] 06Operations: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#3015611 (10Dzahn) This doesn't work to remove hosts from Icinga that are already in it.. it only does for new hosts that have never been added... More changes would be needed to base::monitoring::host to also make... [02:19:11] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5407/" [puppet] - 10https://gerrit.wikimedia.org/r/318442 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [02:22:01] hmm. where did the "Feed" link in Phabricator go? [02:22:54] found it but disappeared from default menu. (it can be configured to come back) [02:26:57] 06Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3015614 (10Dzahn) [02:27:22] i wish there was a feature to tell phabricator "remind me of this task in 6 months" [02:27:28] without having to take it [02:28:09] (03PS2) 10Dzahn: icinga: remove pre-jessie conditional from monitoring::group [puppet] - 10https://gerrit.wikimedia.org/r/318442 (https://phabricator.wikimedia.org/T125023) [02:31:59] 06Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3015662 (10Dzahn) [02:32:34] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.11) (duration: 11m 53s) [02:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:03] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5408/" [puppet] - 10https://gerrit.wikimedia.org/r/336807 (https://phabricator.wikimedia.org/T150936) (owner: 10Hashar) [02:37:59] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Feb 10 02:37:59 UTC 2017 (duration 5m 26s) [02:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:02] PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:55:08] (03PS1) 10Dzahn: icinga/base: revert skipping base monitoring for role::spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/336956 (https://phabricator.wikimedia.org/T151632) [02:55:45] (03CR) 10Dzahn: [V: 032 C: 032] icinga/base: revert skipping base monitoring for role::spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/336956 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [02:55:52] (03PS2) 10Dzahn: icinga/base: revert skipping base monitoring for role::spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/336956 (https://phabricator.wikimedia.org/T151632) [02:59:02] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:13:02] RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [03:19:43] (03PS7) 10Dzahn: Add zuul-merger on contint1001 and contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/336807 (https://phabricator.wikimedia.org/T150936) (owner: 10Hashar) [03:21:32] PROBLEM - puppet last run on kafka1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:22:36] (03CR) 10Dzahn: [C: 032] Add zuul-merger on contint1001 and contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/336807 (https://phabricator.wikimedia.org/T150936) (owner: 10Hashar) [03:23:12] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 703.78 seconds [03:23:32] (03CR) 10Tim Landscheidt: "@chasemp: No problem, everything's fine." [puppet] - 10https://gerrit.wikimedia.org/r/336351 (https://phabricator.wikimedia.org/T157400) (owner: 10Tim Landscheidt) [03:26:12] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 225.90 seconds [03:26:42] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 33 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/zuul/.ssh/id_rsa] [03:28:02] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [03:31:02] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:36:23] (03PS1) 10Dzahn: zuul: ensure /var/lib/zuul/.ssh exists [puppet] - 10https://gerrit.wikimedia.org/r/336958 (https://phabricator.wikimedia.org/T140297) [03:37:05] (03PS2) 10Dzahn: zuul: ensure /var/lib/zuul/.ssh exists [puppet] - 10https://gerrit.wikimedia.org/r/336958 (https://phabricator.wikimedia.org/T140297) [03:38:52] (03PS3) 10Dzahn: zuul: ensure /var/lib/zuul/.ssh exists [puppet] - 10https://gerrit.wikimedia.org/r/336958 (https://phabricator.wikimedia.org/T140297) [03:39:13] ACKNOWLEDGEMENT - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/zuul/.ssh/id_rsa] daniel_zahn https://gerrit.wikimedia.org/r/#/c/336807/ [03:42:20] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5409/" [puppet] - 10https://gerrit.wikimedia.org/r/336958 (https://phabricator.wikimedia.org/T140297) (owner: 10Dzahn) [03:42:28] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/#/c/336958/" [puppet] - 10https://gerrit.wikimedia.org/r/336807 (https://phabricator.wikimedia.org/T150936) (owner: 10Hashar) [03:42:37] (03CR) 10Dzahn: [C: 032] zuul: ensure /var/lib/zuul/.ssh exists [puppet] - 10https://gerrit.wikimedia.org/r/336958 (https://phabricator.wikimedia.org/T140297) (owner: 10Dzahn) [03:42:45] (03PS4) 10Dzahn: zuul: ensure /var/lib/zuul/.ssh exists [puppet] - 10https://gerrit.wikimedia.org/r/336958 (https://phabricator.wikimedia.org/T140297) [03:48:36] !log icinga - live hack fixing config - due to partially removed decom hosts mc2001-mc2016 [03:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:37] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [03:50:47] RECOVERY - puppet last run on kafka1013 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [03:50:57] (03CR) 10Dzahn: "Info: Caching catalog for contint2001.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/336958 (https://phabricator.wikimedia.org/T140297) (owner: 10Dzahn) [03:51:57] PROBLEM - git_daemon_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user [03:52:57] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [03:53:10] ACKNOWLEDGEMENT - git_daemon_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user daniel_zahn https://gerrit.wikimedia.org/r/#/c/336807/ [03:53:28] ACKNOWLEDGEMENT - git_daemon_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user daniel_zahn https://gerrit.wikimedia.org/r/#/c/336807/ [04:00:07] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [04:02:44] (03CR) 10Dzahn: [C: 032] ganglia: switch codfw aggregator to install2002 [puppet] - 10https://gerrit.wikimedia.org/r/336362 (owner: 10Dzahn) [04:03:04] (03PS2) 10Dzahn: ganglia: switch eqiad aggregator to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/336361 [04:03:27] PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:04:17] (03CR) 10Dzahn: [C: 032] ganglia: switch eqiad aggregator to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/336361 (owner: 10Dzahn) [04:05:45] (03PS2) 10Dzahn: ganglia: switch codfw aggregator to install2002 [puppet] - 10https://gerrit.wikimedia.org/r/336362 [04:06:30] !log ganglia - switching aggregators from 1001 to 1002 and 2001 to 2002, there might be minor gaps in the graphs, but hey, it's deprecated anyways [04:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:37] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=445.90 Read Requests/Sec=893.90 Write Requests/Sec=5.20 KBytes Read/Sec=44338.00 KBytes_Written/Sec=1298.40 [04:23:33] (03PS1) 10Dzahn: install/TFTP: use install1002 and install2002 as next-servers [puppet] - 10https://gerrit.wikimedia.org/r/336959 [04:23:37] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=94.20 Read Requests/Sec=65.70 Write Requests/Sec=4.90 KBytes Read/Sec=632.00 KBytes_Written/Sec=312.80 [04:23:46] (03CR) 10jerkins-bot: [V: 04-1] install/TFTP: use install1002 and install2002 as next-servers [puppet] - 10https://gerrit.wikimedia.org/r/336959 (owner: 10Dzahn) [04:24:55] arrrrggg [04:25:04] jerkins-bot [04:25:23] is broken due to adding the new zuul-mergers [04:25:45] now it tries to connect to contint1001 instead of scandium and it fails with Connection refused [04:31:27] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [04:40:48] (03PS1) 10Dzahn: zuul: add contint1001/2001 to zuul merger hosts for ferm [puppet] - 10https://gerrit.wikimedia.org/r/336961 (https://phabricator.wikimedia.org/T150936) [04:42:17] (03CR) 10Dzahn: [C: 032] zuul: add contint1001/2001 to zuul merger hosts for ferm [puppet] - 10https://gerrit.wikimedia.org/r/336961 (https://phabricator.wikimedia.org/T150936) (owner: 10Dzahn) [04:43:44] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/336959 (owner: 10Dzahn) [04:47:54] (03CR) 10Dzahn: "follow-up 2: https://gerrit.wikimedia.org/r/#/c/336961/" [puppet] - 10https://gerrit.wikimedia.org/r/336807 (https://phabricator.wikimedia.org/T150936) (owner: 10Hashar) [05:03:37] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:31:37] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:37] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:00:34] (03PS1) 10Marostegui: beta.my.cnf: Make beta mysql prompt like prod [puppet] - 10https://gerrit.wikimedia.org/r/336964 (https://phabricator.wikimedia.org/T157714) [07:04:03] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3015955 (10Marostegui) Thanks @Papaul for handling all this. I will get the server ready for you on Monday again [07:13:57] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:14:37] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:20:07] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2378 [07:22:20] 06Operations, 13Patch-For-Review, 15User-Elukey: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3015971 (10elukey) [07:22:23] 06Operations, 10Traffic, 13Patch-For-Review, 15User-Elukey: prometheus-vhtcpd-stats cronspamming if vhtcpd is not running yet - https://phabricator.wikimedia.org/T157353#3015970 (10elukey) 05Open>03Resolved [07:25:07] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 916552 Threads: 1 Questions: 16214284 Slow queries: 4793 Opens: 7261 Flush tables: 1 Open tables: 571 Queries per second avg: 17.690 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [07:30:25] (03Abandoned) 10Giuseppe Lavagetto: Use etcd_index for the initial replication index in dump and load [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/336641 (owner: 10Giuseppe Lavagetto) [07:32:07] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:38:27] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:40:26] (03PS2) 10Giuseppe Lavagetto: profile::etcd::replication: refactor to make failover easier [puppet] - 10https://gerrit.wikimedia.org/r/336850 (https://phabricator.wikimedia.org/T156009) [07:42:16] (03CR) 10jerkins-bot: [V: 04-1] profile::etcd::replication: refactor to make failover easier [puppet] - 10https://gerrit.wikimedia.org/r/336850 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [07:42:57] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:43:37] PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:44:55] 06Operations, 10ops-esams, 10hardware-requests, 13Patch-For-Review: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883#3015990 (10MoritzMuehlenhoff) Hmm, cp3014,cp3020,cp3022 are still listed in https://servermon.wikimedia.org/hosts/, though. No idea why, let's wait for @akosiaris to... [07:45:07] (03PS3) 10Giuseppe Lavagetto: profile::etcd::replication: refactor to make failover easier [puppet] - 10https://gerrit.wikimedia.org/r/336850 (https://phabricator.wikimedia.org/T156009) [07:45:27] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:46:01] (03CR) 10jerkins-bot: [V: 04-1] profile::etcd::replication: refactor to make failover easier [puppet] - 10https://gerrit.wikimedia.org/r/336850 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [07:46:18] <_joe_> I am struggling to understand this -1 from jerkins-bot [07:46:57] <_joe_> oh, uhm, I see... [07:47:07] _joe_: there is also this: [07:47:20] 04:25 < mutante> jerkins-bot [07:47:28] 04:25 < mutante> is broken due to adding the new zuul-mergers [07:47:38] might be related... [07:47:43] <_joe_> marostegui: nope [07:47:50] ok :) [07:47:51] <_joe_> it's a pebkac [07:47:59] XDDD [07:48:39] <_joe_> it's still a parser fail ofc [07:50:11] (03PS4) 10Giuseppe Lavagetto: profile::etcd::replication: refactor to make failover easier [puppet] - 10https://gerrit.wikimedia.org/r/336850 (https://phabricator.wikimedia.org/T156009) [07:51:03] (03CR) 10jerkins-bot: [V: 04-1] profile::etcd::replication: refactor to make failover easier [puppet] - 10https://gerrit.wikimedia.org/r/336850 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [07:51:43] <_joe_> again? uhm [07:52:16] <_joe_> /o\ I give up on you again, puppet [07:52:33] (03PS1) 10Muehlenhoff: Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/336970 [07:53:37] (03PS2) 10Muehlenhoff: Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/336970 [07:54:30] (03PS5) 10Giuseppe Lavagetto: profile::etcd::replication: refactor to make failover easier [puppet] - 10https://gerrit.wikimedia.org/r/336850 (https://phabricator.wikimedia.org/T156009) [07:55:34] (03CR) 10Muehlenhoff: [C: 032] Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/336970 (owner: 10Muehlenhoff) [08:00:07] RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [08:11:37] RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:16:28] (03PS6) 10Giuseppe Lavagetto: profile::etcd::replication: refactor to make failover easier [puppet] - 10https://gerrit.wikimedia.org/r/336850 (https://phabricator.wikimedia.org/T156009) [08:22:02] (03PS1) 10Elukey: [WIP] Replace mc1001 with mc1019 [puppet] - 10https://gerrit.wikimedia.org/r/336972 [08:23:09] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::replication: refactor to make failover easier [puppet] - 10https://gerrit.wikimedia.org/r/336850 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [08:30:39] !log Deploye alter table s3 officewiki.echo_notification and mediawikiwiki.echo_notification tables only on codfw - T136428 [08:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:44] T136428: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428 [08:35:49] (03PS2) 10Gehel: wdqs - icinga process check more relaxed on arguments [puppet] - 10https://gerrit.wikimedia.org/r/336878 [08:36:50] (03CR) 10Gehel: [C: 032] wdqs - icinga process check more relaxed on arguments [puppet] - 10https://gerrit.wikimedia.org/r/336878 (owner: 10Gehel) [08:38:27] RECOVERY - Blazegraph process on wdqs1003 is OK: PROCS OK: 1 process with UID = 997 (blazegraph), regex args ^java .* blazegraph-service-.*war [08:39:37] RECOVERY - Blazegraph process on wdqs2003 is OK: PROCS OK: 1 process with UID = 997 (blazegraph), regex args ^java .* blazegraph-service-.*war [08:40:34] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I do not think this patch is salvageable:" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [08:41:14] !log restarting kafka mirror maker and jmxtrans of kafka[12]00[123] for java security upgrades [08:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:41] (03PS1) 10Gehel: elasticsearch - reimage elastic20(25|26|27|28) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336973 (https://phabricator.wikimedia.org/T151326) [08:44:53] !log upgrading hhvm on mw1200-mw1229 [08:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:28] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic20(25|26|27|28).codfw.wmnet [08:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:37] (03PS2) 10Gehel: elasticsearch - reimage elastic20(25|26|27|28) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336973 (https://phabricator.wikimedia.org/T151326) [08:48:13] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: (null) [08:51:51] _joe_ --^ [08:52:21] (as fyi, I know that you are still working on it :) [08:52:46] Bytes sent by scb nodes has been reduced since the latest deployment of ores which returns minified version [08:52:47] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Service%20Cluster%20B%20eqiad&h=scb1001.eqiad.wmnet&r=day&z=default&jr=&js=&st=1486711183&event=hide&ts=0&v=527603.88&m=bytes_out&vl=bytes%2Fsec&ti=Bytes%20Sent&z=large [08:54:03] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic20(25|26|27|28) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336973 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [08:54:43] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:55:02] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info - https://phabricator.wikimedia.org/T157714#3016084 (10jcrespo) p:05Triage>03Low Let's get the ok from beta owners on any changes. [08:58:09] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3016086 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2025.codfw.wmnet'] ```... [08:58:48] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3016087 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2026.codfw.wmnet'] ```... [08:59:24] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3016088 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2028.codfw.wmnet'] ```... [09:00:31] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3016089 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2027.codfw.wmnet'] ```... [09:00:35] <_joe_> elukey: yup :) [09:01:09] <_joe_> Amir1: \o/ [09:01:43] (03CR) 10Hashar: [C: 031] "Please do! I wish MySQL/MariaDB made that the default prompt :-}" [puppet] - 10https://gerrit.wikimedia.org/r/336964 (https://phabricator.wikimedia.org/T157714) (owner: 10Marostegui) [09:01:54] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info - https://phabricator.wikimedia.org/T157714#3014147 (10hashar) antoine-approve [09:01:59] (03PS3) 10Ema: varnish: remove ganglia python module [puppet] - 10https://gerrit.wikimedia.org/r/336778 [09:02:05] (03CR) 10Ema: [V: 032 C: 032] varnish: remove ganglia python module [puppet] - 10https://gerrit.wikimedia.org/r/336778 (owner: 10Ema) [09:02:49] :) [09:03:03] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:09:11] hashar: you got 2 new zuul mergers. 2 gerrit follow-ups and some phab comments:) cheers and good day. out [09:11:51] Amir1: nice! https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=scb1001&from=now-7d&to=now [09:12:14] I see that other metrics improved as well [09:12:52] * elukey promotes Prometheus metrics [09:14:33] :D [09:14:33] mutante: noticed that :} [09:14:47] mutante: that is quite awesome thank you very much for that [09:15:10] mutante: processing the backlog and I will look at the follow up patches to clear out the icinga alarms [09:15:14] elukey: those ones were related to stopping the bot that was using ORES extensively [09:15:58] hashar: :) welcome! and also.. the comment and achievement on closing that _EPIC_ ticket about linting and puppet style. cheers to that. [09:16:29] mutante: next step is adding rspec tests :] [09:16:31] i remember when the number of warnigns and errors was more than the number of lines. lol. and now it's 0 [09:17:12] hashar: I'll send you some Swiss chocolate when you get those rspec to run! [09:17:12] ok, i gotta learn [09:18:26] signs out [09:18:35] mutante: and you managed to follow up with everything. Danke!!! have a good night [09:18:43] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [09:18:44] gehel: you can already pack those chocolates in a box :D [09:19:03] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:22:43] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:24:20] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3016219 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2026.codfw.wmnet'] ``` and were **ALL** successful. [09:26:37] !log stopped zuul-merger process on contint1001 and contint2001. They lack the git-daemon service to expose the merges. [09:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:50] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3016237 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2025.codfw.wmnet'] ``` and were **ALL** successful. [09:28:07] PROBLEM - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [09:29:17] PROBLEM - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [09:29:28] ^^^ that is me [09:29:32] stopped them on purpose [09:29:40] until I manage to get the git-daemon process to spawn as well [09:31:07] RECOVERY - zuul_merger_service_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [09:31:26] rhgh puppet [09:35:07] PROBLEM - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [09:37:26] (03PS1) 10Giuseppe Lavagetto: profile::etcd::replication: fix monitoring port for icinga [puppet] - 10https://gerrit.wikimedia.org/r/336975 [09:38:24] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::etcd::replication: fix monitoring port for icinga [puppet] - 10https://gerrit.wikimedia.org/r/336975 (owner: 10Giuseppe Lavagetto) [09:42:07] RECOVERY - git_daemon_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon --user [09:43:07] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:47:17] (03CR) 10Reedy: [C: 031] beta.my.cnf: Make beta mysql prompt like prod [puppet] - 10https://gerrit.wikimedia.org/r/336964 (https://phabricator.wikimedia.org/T157714) (owner: 10Marostegui) [09:51:44] (03PS1) 10ArielGlenn: temp removal of wansec.com mirror from list, dns issues [puppet] - 10https://gerrit.wikimedia.org/r/336976 [09:51:53] !log Reenabling puppet and zuul-merger on contint1001 and contint2001. The git-daemon is running now T140297 T150936. The 'systemctl status git-daemon' thought that the service was running when it was not (filled T157785 ) [09:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:00] T157785: zuul-merger git-daemon process is not start properly by systemd ? - https://phabricator.wikimedia.org/T157785 [09:52:00] T150936: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936 [09:52:14] 06Operations, 10Pybal, 10Traffic, 15User-Joe: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759#2922958 (10ema) Pybal not failing over to the next DNS server in resolv.conf has been mentioned in T83662 as well. [09:53:07] RECOVERY - zuul_merger_service_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [09:53:17] RECOVERY - zuul_merger_service_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [09:54:04] (03PS2) 10ArielGlenn: temp removal of wansec.com mirror from list, dns issues [puppet] - 10https://gerrit.wikimedia.org/r/336976 [09:55:29] (03CR) 10ArielGlenn: [C: 032] temp removal of wansec.com mirror from list, dns issues [puppet] - 10https://gerrit.wikimedia.org/r/336976 (owner: 10ArielGlenn) [09:56:57] RECOVERY - Check systemd state on dataset1001 is OK: OK - running: The system is fully operational [09:57:09] 06Operations, 10Pybal, 10Traffic: Unhandled error stopping pybal: 'RunCommandMonitoringProtocol' object has no attribute 'checkCall' - https://phabricator.wikimedia.org/T157786#3016326 (10ema) [09:57:21] 06Operations, 10Pybal, 10Traffic: Unhandled error stopping pybal: 'RunCommandMonitoringProtocol' object has no attribute 'checkCall' - https://phabricator.wikimedia.org/T157786#3016339 (10ema) p:05Triage>03Normal [10:00:57] RECOVERY - Check systemd state on ms1001 is OK: OK - running: The system is fully operational [10:03:43] !log rebooting contint2001 [10:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:37] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:06:51] (03PS1) 10Hashar: contint: git-daemon service is 'sysvinit' [puppet] - 10https://gerrit.wikimedia.org/r/336978 (https://phabricator.wikimedia.org/T157785) [10:06:53] !log roll-restart restbase in codfw to pick up new statsd.eqiad.wmnet - T157022 [10:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:59] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [10:07:24] (03PS2) 10Giuseppe Lavagetto: prometheus::class_config: allow new selections for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/336851 [10:07:37] PROBLEM - jenkins_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [10:08:16] (03CR) 10Volans: "see inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/336851 (owner: 10Giuseppe Lavagetto) [10:08:46] _joe_: this is for the previous version... you just uploaded while I was replying :D [10:09:29] <_joe_> volans: nope, targets.length is not always true [10:09:40] <_joe_> targets.lenght can be 0 [10:09:46] <_joe_> it easily is, actually [10:09:55] and if 0 is true in ruby [10:10:31] <_joe_> oh right [10:10:34] !log lvs1007-10: upgrade to jessie 8.7, pybal 1.13.4, reboot into kernel 4.4.2-3+wmf8 T155401 [10:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:38] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [10:20:29] !log restart of jmxtrans on analytics by elukey - T157022 [10:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:36] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [10:21:18] (03CR) 10Volans: [C: 031] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/327686 (owner: 10Krinkle) [10:21:35] _joe_: could you take a look too? ^^^ [10:30:37] RECOVERY - jenkins_service_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [10:30:39] !log roll-restart restbase in eqiad to pick up new statsd.eqiad.wmnet - T157022 [10:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:43] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [10:32:38] RECOVERY - puppet last run on ms-be1026 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:34:01] !log roll-restart ocg to pick up new statsd.eqiad.wmnet - T157022 [10:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:14] !log roll-restart of aqs to pick up new statsd.eqiad.wmnet - T157022 [10:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:19] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [10:39:18] !log roll-restart parsoid in codfw/eqiad to pick up new statsd.eqiad.wmnet - T157022 [10:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:11] 06Operations, 10OfflineContentGenerator, 06Reading-Web-Backlog, 06Services (watching): Confirm attribution needs - https://phabricator.wikimedia.org/T150875#3016456 (10Moushira) [10:54:44] !log roll-restart karthoterian in codfw/eqiad to pick up new statsd.eqiad.wmnet - T157022 [10:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:49] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [10:55:33] !Deploy alter table s3 officewiki.echo_notification and mediawikiwiki.echo_notification tables only on eqiad - https://phabricator.wikimedia.org/T136428 [10:57:28] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM in general, one minor comment on code organization." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [10:58:48] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#2802081 (10Tobi_WMDE_SW) 05Open>03Resolved [10:58:49] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#3016495 (10Tobi_WMDE_SW) [10:59:28] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "cool, one error to fix though." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333247 (owner: 10Filippo Giunchedi) [11:02:59] (03CR) 10Jcrespo: [C: 032] "I can deploy this right now, no problem. But note that connnecting from localhost will show user@localhost [db]> rather than the hostname." [puppet] - 10https://gerrit.wikimedia.org/r/336964 (https://phabricator.wikimedia.org/T157714) (owner: 10Marostegui) [11:03:07] (03PS2) 10Jcrespo: beta.my.cnf: Make beta mysql prompt like prod [puppet] - 10https://gerrit.wikimedia.org/r/336964 (https://phabricator.wikimedia.org/T157714) (owner: 10Marostegui) [11:03:58] (03CR) 10Jcrespo: [C: 032] "Also, this is the prompt for the server, which will not affect external clients, so my previous comment will apply always." [puppet] - 10https://gerrit.wikimedia.org/r/336964 (https://phabricator.wikimedia.org/T157714) (owner: 10Marostegui) [11:06:26] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info - https://phabricator.wikimedia.org/T157714#3016508 (10jcrespo) I have merged the above patch as requested, but as I commented there, I do not think that will solve the tic... [11:06:39] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3016509 (10jcrespo) [11:07:48] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:08:18] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [11:08:58] PROBLEM - puppet last run on db1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:09:18] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [11:09:48] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [11:10:01] !log restart navtiming ve asset-check statsd-mw-js-deprecate on hafnium to pick up statsd.eqiad.wmnet change - T157022 [11:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:06] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [11:11:19] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3014147 (10Marostegui) We can always add it to the `[client]` section too [11:12:46] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3016526 (10jcrespo) >>! In T157714#3016090, @hashar wrote: > antoine-approve photo You are aging well, you look younge... [11:15:12] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3016542 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2028.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic202... [11:15:26] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3016543 (10jcrespo) >>! In T157714#3016524, @Marostegui wrote: > We can always add it to the `[client]` section too No... [11:16:21] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3016544 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2027.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic202... [11:16:53] !log roll-restart tilerator in codfw/eqiad to pick up new statsd.eqiad.wmnet - T157022 [11:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:57] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [11:18:24] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3016547 (10hashar) > You are aging well, you look younger now than on the profile photo. I am actually younger on the... [11:19:12] !log roll-restart jmxtrans on conf* in codfw/eqiad to pick up new statsd.eqiad.wmnet - T157022 [11:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:00] !log roll-restart parsoid on ruthenium to pick up new statsd.eqiad.wmnet - T157022 [11:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:04] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [11:23:08] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:25:16] !log roll-restart nodepool on labnodepool1001 to pick up new statsd.eqiad.wmnet - T157022 [11:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:50] hashar: ^ FYI, in case it is somehow distructive [11:26:48] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [11:27:48] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3464789 keys, up 102 days 3 hours - replication_delay is 0 [11:30:13] !log roll-restart changeprop on scb in eqiad/codfw to pick up new statsd.eqiad.wmnet - T157022 [11:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:17] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [11:33:46] (03PS1) 10Jcrespo: admin-jynus: Update my alias to show static hostaname [puppet] - 10https://gerrit.wikimedia.org/r/336987 [11:36:21] !log roll-restart mathoid/citoid/mobileapps/cxserver/eventstreams/graphoid on scb in eqiad/codfw to pick up new statsd.eqiad.wmnet - T157022 [11:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:26] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [11:36:32] (03CR) 10Giuseppe Lavagetto: prometheus::class_config: allow new selections for prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/336851 (owner: 10Giuseppe Lavagetto) [11:36:50] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3016567 (10jcrespo) Have a look on how I handle it on production with an alias- enforce the host staticly rathen than \... [11:36:58] RECOVERY - puppet last run on db1075 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [11:38:31] (03PS3) 10Giuseppe Lavagetto: prometheus::class_config: allow new selections for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/336851 [11:41:20] !log roll-restart trendingedits on scb in eqiad/codfw to pick up new statsd.eqiad.wmnet - T157022 [11:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:33] (03PS2) 10Jcrespo: admin-jynus: Update my alias to show static hostaname [puppet] - 10https://gerrit.wikimedia.org/r/336987 [11:42:49] (03CR) 10Jcrespo: [C: 032] admin-jynus: Update my alias to show static hostaname [puppet] - 10https://gerrit.wikimedia.org/r/336987 (owner: 10Jcrespo) [11:45:22] (03PS1) 10Muehlenhoff: Update to 4.4.48 [debs/linux44] - 10https://gerrit.wikimedia.org/r/336989 [11:45:59] (03CR) 10Jcrespo: [V: 032 C: 032] admin-jynus: Update my alias to show static hostaname [puppet] - 10https://gerrit.wikimedia.org/r/336987 (owner: 10Jcrespo) [11:46:29] !log roll-restart tileratorui in codfw/eqiad to pick up new statsd.eqiad.wmnet - T157022 [11:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:35] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [11:48:28] phew, that might be it to fully drain graphite1001 [11:49:40] (03PS1) 10Zhuyifei1999: toollabs: Preparing to move `/usr/local/bin/crontab` to labs/toollabs [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) [11:50:38] (03CR) 10Zhuyifei1999: [C: 04-1] "Not yet. The patch on the other repo is not ready." [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999) [11:52:08] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [11:54:09] (03PS1) 10Giuseppe Lavagetto: profile::etcd::replication: lag can be negative, quotes [puppet] - 10https://gerrit.wikimedia.org/r/336991 [11:55:30] (03PS2) 10Giuseppe Lavagetto: profile::etcd::replication: lag can be negative, quotes [puppet] - 10https://gerrit.wikimedia.org/r/336991 [11:56:29] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::etcd::replication: lag can be negative, quotes [puppet] - 10https://gerrit.wikimedia.org/r/336991 (owner: 10Giuseppe Lavagetto) [11:58:23] hi [11:59:10] 06Operations, 10Pybal, 10Traffic: lvs servers report 'Memory allocation problem' on bootup - https://phabricator.wikimedia.org/T82849#3016610 (10ema) 05Resolved>03Open [12:00:10] !log bounce mwerrors on eventlog1001 to pick up statsd cname change - T157022 [12:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:15] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [12:00:44] this page vandalised [12:00:45] https://en.wikipedia.org/w/index.php?title=GNU_Lesser_General_Public_License&type=revision&diff=761894199&oldid=754101653 [12:00:57] please get it back [12:02:09] 06Operations, 10Pybal, 10Traffic: lvs servers report 'Memory allocation problem' on bootup - https://phabricator.wikimedia.org/T82849#3016616 (10ema) This is still happening. @chasemp mentioned in T113597 that the error (from ipvsadm) can be reproduced referencing a pool that doesn't exist. [12:04:19] (03CR) 10Volans: [C: 04-1] "Nice. I think is better now with all the data gathering in the .pp instead of the templates ;)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/336851 (owner: 10Giuseppe Lavagetto) [12:07:32] (03CR) 10Giuseppe Lavagetto: prometheus::class_config: allow new selections for prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/336851 (owner: 10Giuseppe Lavagetto) [12:09:02] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#3016629 (10fgiunchedi) >>! In T157022#3002347, @Cmjohnson wrote: > @fgiunchedi I have the ssds on-site. The disk is in a 3.5" internal disk bay and will need to be powered off for t... [12:09:55] (03PS4) 10Filippo Giunchedi: graphite: move alerts to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335763 (https://phabricator.wikimedia.org/T157022) [12:10:11] (03PS1) 10Muehlenhoff: Extended MOUs for ISI researchers [puppet] - 10https://gerrit.wikimedia.org/r/336992 [12:10:13] <_joe_> volans: see my comment [12:10:44] just saw it, I was searching for the puppetdb.resources() source code :D [12:11:04] given that grouphosts is passed as is by the query_resource() [12:11:14] s/resource/resources/ [12:11:44] !log updating firewall rules for analytics on cr1/cr2 [12:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:28] (03CR) 10Muehlenhoff: [C: 032] Extended MOUs for ISI researchers [puppet] - 10https://gerrit.wikimedia.org/r/336992 (owner: 10Muehlenhoff) [12:15:30] (03PS1) 10Muehlenhoff: Record extended NDA/contract dates for CPS frtech consultants [puppet] - 10https://gerrit.wikimedia.org/r/336994 [12:15:36] 06Operations: diamond crashing on hosts using systemd-timesyncd - https://phabricator.wikimedia.org/T157794#3016635 (10ema) [12:16:14] 06Operations, 10Monitoring, 10Traffic: diamond crashing on hosts using systemd-timesyncd - https://phabricator.wikimedia.org/T157794#3016651 (10ema) [12:22:12] 06Operations, 10Monitoring, 10Traffic: diamond crashing on hosts using systemd-timesyncd - https://phabricator.wikimedia.org/T157794#3016652 (10fgiunchedi) > Should we remove /usr/share/diamond/collectors/ntpd/ if systemd-timesyncd is in use? If that isn't too messy on the puppet level I think it'd make sen... [12:25:21] (03CR) 10Muehlenhoff: [C: 032] Record extended NDA/contract dates for CPS frtech consultants [puppet] - 10https://gerrit.wikimedia.org/r/336994 (owner: 10Muehlenhoff) [12:27:40] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:29:55] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3016653 (10elukey) Added kafka2003, fixed Archiva. [12:32:22] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,service=nginx,name=mw1229.eqiad.wmnet [12:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:59] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,service=nginx,name=mw1228.eqiad.wmnet [12:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:15] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,service=nginx,name=mw1227.eqiad.wmnet [12:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:36] 06Operations: /etc/localtime should be a symbolic link - https://phabricator.wikimedia.org/T157795#3016655 (10ema) [12:36:31] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3016675 (10elukey) Other batches: Fix logstash IPs: ``` set firewall family inet filter analytics-in4 term logstash from destination-address 10.64.0.122 set firewall family inet filter... [12:45:20] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:52:42] 06Operations, 10Wikimedia-Shop, 07Security-Other: approval for shop.wikimedia.org with shopify/digicert - https://phabricator.wikimedia.org/T132172#3016683 (10Aklapper) a:05Stype_and_Co.-WMF>03None [12:54:16] (03CR) 10Zhuyifei1999: "I32cdbfbf2955c8cc6a020968cebd78b458139a08" [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999) [12:55:31] (03PS2) 10Zhuyifei1999: toollabs: Preparing to move `/usr/local/bin/crontab` to labs/toollabs [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) [12:55:40] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:56:20] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: (null) [12:57:22] 06Operations: systemd-timedated starting up every minute - https://phabricator.wikimedia.org/T157797#3016701 (10ema) [13:01:43] (03PS1) 10Giuseppe Lavagetto: profile::etcd::replication: attempt not to confuse icinga [puppet] - 10https://gerrit.wikimedia.org/r/337000 [13:02:15] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::etcd::replication: attempt not to confuse icinga [puppet] - 10https://gerrit.wikimedia.org/r/337000 (owner: 10Giuseppe Lavagetto) [13:05:55] lol for the title [13:08:06] (03PS1) 10Ema: varnish: remove ganglia vhtcpd python module [puppet] - 10https://gerrit.wikimedia.org/r/337001 [13:13:21] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 149 bytes in 0.073 second response time [13:13:21] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [13:14:27] (03PS1) 10Ema: varnish: remove varnish::monitoring::ganglia [puppet] - 10https://gerrit.wikimedia.org/r/337002 [13:14:35] (03CR) 10jerkins-bot: [V: 04-1] varnish: remove varnish::monitoring::ganglia [puppet] - 10https://gerrit.wikimedia.org/r/337002 (owner: 10Ema) [13:21:28] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3016747 (10elukey) Fixed logstash IPs, added install1002 (208.80.154.86/32) but not removed the other ones (for the moment). [13:23:21] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.130 second response time [13:30:15] (03PS1) 10Jcrespo: Add script to generate mysql client-only .deb package [software] - 10https://gerrit.wikimedia.org/r/337007 [13:33:54] (03PS1) 10Hashar: zuul: use a proper require for the merger class [puppet] - 10https://gerrit.wikimedia.org/r/337008 [13:37:06] (03PS2) 10Jcrespo: Add script to generate mysql client-only .deb package [software] - 10https://gerrit.wikimedia.org/r/337007 (https://phabricator.wikimedia.org/T157702) [13:44:57] (03CR) 10Hashar: [C: 031] "Puppet compiler https://puppet-compiler.wmflabs.org/5418/ looks fine. That is one less oddity :}" [puppet] - 10https://gerrit.wikimedia.org/r/337008 (owner: 10Hashar) [13:45:42] (03PS1) 10Muehlenhoff: Don't enable the Diamond ntpd collector if systemd-timesyncd is used [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) [13:46:06] (03PS1) 10Elukey: Apply role memcached to the new mc1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/337010 (https://phabricator.wikimedia.org/T137345) [13:50:21] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.156 second response time [14:09:13] (03PS1) 10Hashar: admin: basic .vimrc for hashar [puppet] - 10https://gerrit.wikimedia.org/r/337014 [14:13:26] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3016904 (10jcrespo) @Marostegui Let's assume there is not blockers (which we do have) and make a full replacement plan. [14:16:13] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3016922 (10jcrespo) [14:16:15] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Remove db1002-db1007 from production - https://phabricator.wikimedia.org/T105768#3016921 (10jcrespo) [14:20:06] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3016928 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2027.codfw.wmnet'] ```... [14:21:14] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3016929 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2028.codfw.wmnet'] ```... [14:24:11] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:24:28] (03PS1) 10Jcrespo: Install wmf-mariadb-client for client-only installs [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/337018 (https://phabricator.wikimedia.org/T157702) [14:26:20] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3016954 (10ovasileva) [14:26:44] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic20(25|26).codfw.wmnet [14:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:40] (03CR) 10Jcrespo: [C: 032] Install wmf-mariadb-client for client-only installs [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/337018 (https://phabricator.wikimedia.org/T157702) (owner: 10Jcrespo) [14:33:03] (03PS1) 10Jcrespo: Apply a3ded1b40909f9351 (mariadb client install) to production [puppet] - 10https://gerrit.wikimedia.org/r/337022 (https://phabricator.wikimedia.org/T157702) [14:35:13] (03CR) 10Filippo Giunchedi: Don't enable the Diamond ntpd collector if systemd-timesyncd is used (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) (owner: 10Muehlenhoff) [14:35:41] (03PS1) 10Hashar: Remove zuul-merger from scandium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/337023 (https://phabricator.wikimedia.org/T150936) [14:35:57] (03PS5) 10Filippo Giunchedi: graphite: move alerts to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335763 (https://phabricator.wikimedia.org/T157022) [14:37:12] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:38:29] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936#3016982 (10hashar) @Dzahn @RobH we no more need scandium.eqiad.wmnet. It was solely running the `zuul-merger` service which is now running on contint10... [14:38:45] (03CR) 10Jcrespo: [C: 031] "I'm rather confident to apply this, I've tested it manually by doing:" [puppet] - 10https://gerrit.wikimedia.org/r/337022 (https://phabricator.wikimedia.org/T157702) (owner: 10Jcrespo) [14:40:57] (03CR) 10Jcrespo: [C: 031] misc.my.cnf.erb: Enable barracuda and innodb_strict_mode [puppet] - 10https://gerrit.wikimedia.org/r/321638 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [14:42:04] (03CR) 10Jcrespo: [C: 032] Introduce linters using rake [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/331329 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:44:21] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:45:53] (03PS1) 10Jcrespo: Deploy 3a09aee8dd90d8f to production (Introduce linters using rake) [puppet] - 10https://gerrit.wikimedia.org/r/337025 [14:46:08] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T157237#3017014 (10fgiunchedi) >>! In T157237#3005905, @Cmjohnson wrote: > I am assuming this is one of the ssds when I pull the pd list with megacli a ssd is missing. Please confirm. The system is out of warranty but w... [14:46:28] (03CR) 10Jcrespo: [C: 031] Deploy 3a09aee8dd90d8f to production (Introduce linters using rake) [puppet] - 10https://gerrit.wikimedia.org/r/337025 (owner: 10Jcrespo) [14:46:47] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3017015 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2027.codfw.wmnet'] ``` and were **ALL** successful. [14:46:51] (03CR) 10Jcrespo: [C: 031] "Waiting for confirmation for deploy." [puppet] - 10https://gerrit.wikimedia.org/r/337025 (owner: 10Jcrespo) [14:48:04] (03PS4) 10Hashar: ores: Send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/321096 (https://phabricator.wikimedia.org/T149010) (owner: 10Ladsgroup) [14:48:18] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] graphite: move alerts to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335763 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [14:48:38] (03CR) 10Hashar: "rebased, fixed trivial conflicts and reapplied on beta cluster" [puppet] - 10https://gerrit.wikimedia.org/r/321096 (https://phabricator.wikimedia.org/T149010) (owner: 10Ladsgroup) [14:50:51] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3017020 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2028.codfw.wmnet'] ```... [14:52:15] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3017025 (10hashar) Worth noting, I think most of us use the deployment server deployment-tin.eqiad.wmnet to connect to... [14:55:01] 06Operations, 10Beta-Cluster-Infrastructure, 10DBA, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3017032 (10jcrespo) Whatever you decide, I would be happy to help- this is is more of a client issue rather than server... [14:55:03] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [14:58:45] 06Operations, 10Analytics, 10Analytics-Cluster: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3017060 (10elukey) p:05Triage>03Normal [14:59:33] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3017069 (10Marostegui) >>! In T134476#3016904, @jcrespo wrote: > @Marostegui Let's assume there is not blockers (which we do have) and make a full replacement plan. Sounds good to me [14:59:44] RECOVERY - Check systemd state on ms-fe1007 is OK: OK - running: The system is fully operational [15:02:01] (03PS4) 10Giuseppe Lavagetto: prometheus::class_config: allow new selections for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/336851 [15:02:05] (03CR) 10Hashar: "recheck" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [15:02:12] (03CR) 10jerkins-bot: [V: 04-1] Resolve hanging mysql group with uid 1000 for new reimages [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [15:02:44] RECOVERY - Check systemd state on ms-fe1008 is OK: OK - running: The system is fully operational [15:04:22] (03PS2) 10Jcrespo: Resolve hanging mysql group with uid 1000 for new reimages [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/336800 (https://phabricator.wikimedia.org/T100501) [15:04:26] (03PS5) 10Filippo Giunchedi: Enable JMX exporter on RESTBase Staging nodes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/335826 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [15:06:02] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Enable JMX exporter on RESTBase Staging nodes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/335826 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [15:09:43] !log bounce cassandra-a on xenon after https://gerrit.wikimedia.org/r/335826 [15:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:10] (03CR) 10Marostegui: [C: 031] Add script to generate mysql client-only .deb package [software] - 10https://gerrit.wikimedia.org/r/337007 (https://phabricator.wikimedia.org/T157702) (owner: 10Jcrespo) [15:11:10] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/336851 (owner: 10Giuseppe Lavagetto) [15:11:34] PROBLEM - cassandra-a service on xenon is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:11:34] PROBLEM - Check systemd state on xenon is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:11:44] PROBLEM - cassandra-a CQL 10.64.0.202:9042 on xenon is CRITICAL: connect to address 10.64.0.202 and port 9042: Connection refused [15:11:54] PROBLEM - cassandra-a SSL 10.64.0.202:7001 on xenon is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:12:25] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:12:35] expected ^ [15:12:44] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [15:12:44] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [15:12:44] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 44 ESP OK [15:12:54] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 44 ESP OK [15:12:54] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 44 ESP OK [15:12:54] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 44 ESP OK [15:12:54] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 44 ESP OK [15:12:54] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 44 ESP OK [15:13:04] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [15:13:05] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [15:13:05] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 44 ESP OK [15:13:14] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [15:13:14] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [15:13:14] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [15:13:14] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 44 ESP OK [15:13:24] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 44 ESP OK [15:13:24] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 44 ESP OK [15:13:24] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 44 ESP OK [15:13:24] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 44 ESP OK [15:13:24] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 44 ESP OK [15:13:25] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 44 ESP OK [15:13:33] ^ this is me trying to see if there's hope for cp1052 T148891 [15:13:33] T148891: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891 [15:13:34] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [15:13:47] (03PS1) 10Muehlenhoff: Add account validation script / cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) [15:14:14] 06Operations, 06Parsing-Team, 13Patch-For-Review: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177#3017134 (10ssastry) This has turned into a rabbit hole but a good one :-) .. I've started updating node modules to newer versions, migrating code to use promises, a... [15:14:33] 06Operations, 06Parsing-Team, 13Patch-For-Review: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177#3017135 (10ssastry) 05Open>03Resolved a:03ssastry [15:15:02] (03CR) 10jerkins-bot: [V: 04-1] Add account validation script / cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [15:15:35] !log temporarily disabling mariadb replication lag checks to deploy new version of the icinga check script [15:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:36] (03PS2) 10Jcrespo: Apply a3ded1b40909f9351 (mariadb client install) to production [puppet] - 10https://gerrit.wikimedia.org/r/337022 (https://phabricator.wikimedia.org/T157702) [15:17:34] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [15:17:34] RECOVERY - Check systemd state on xenon is OK: OK - running: The system is fully operational [15:17:44] RECOVERY - cassandra-a CQL 10.64.0.202:9042 on xenon is OK: TCP OK - 0.000 second response time on 10.64.0.202 port 9042 [15:17:54] RECOVERY - cassandra-a SSL 10.64.0.202:7001 on xenon is OK: SSL OK - Certificate xenon-a valid until 2017-09-08 16:32:33 +0000 (expires in 210 days) [15:18:25] (03PS3) 10Jcrespo: Add script to generate mysql client-only .deb package [software] - 10https://gerrit.wikimedia.org/r/337007 (https://phabricator.wikimedia.org/T157702) [15:18:34] (03CR) 10Jcrespo: [C: 032] Apply a3ded1b40909f9351 (mariadb client install) to production [puppet] - 10https://gerrit.wikimedia.org/r/337022 (https://phabricator.wikimedia.org/T157702) (owner: 10Jcrespo) [15:19:41] (03PS2) 10Muehlenhoff: Add account validation script / cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) [15:20:44] 06Operations, 06Parsing-Team, 13Patch-For-Review: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177#3017144 (10Volans) @ssastry: does this mean that https://gerrit.wikimedia.org/r/#/c/334452 can be reverted and restart the 2 services? [15:20:46] (03CR) 10jerkins-bot: [V: 04-1] Add account validation script / cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [15:21:01] (03PS2) 10Jcrespo: Deploy 3a09aee8dd90d8f to production (Introduce linters using rake) [puppet] - 10https://gerrit.wikimedia.org/r/337025 [15:21:27] interesting pathch, moritzm [15:21:39] I may add you to a user-related patch [15:21:43] if you are ok [15:22:16] sure! [15:22:24] no rush [15:22:31] (03PS1) 10Eevans: Fix broken path to Prometheus exporter config [puppet] - 10https://gerrit.wikimedia.org/r/337034 (https://phabricator.wikimedia.org/T155120) [15:22:52] mine is a long term marathon, but need input from the ones that really know stuff [15:23:37] (03CR) 10Jcrespo: [C: 032] Deploy 3a09aee8dd90d8f to production (Introduce linters using rake) [puppet] - 10https://gerrit.wikimedia.org/r/337025 (owner: 10Jcrespo) [15:24:03] (03PS2) 10Eevans: Fix broken path to Prometheus exporter config [puppet] - 10https://gerrit.wikimedia.org/r/337034 (https://phabricator.wikimedia.org/T155120) [15:24:12] (03CR) 10Giuseppe Lavagetto: [C: 032] prometheus::class_config: allow new selections for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/336851 (owner: 10Giuseppe Lavagetto) [15:24:25] (03PS5) 10Giuseppe Lavagetto: prometheus::class_config: allow new selections for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/336851 [15:24:31] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] prometheus::class_config: allow new selections for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/336851 (owner: 10Giuseppe Lavagetto) [15:25:46] 06Operations, 06Parsing-Team, 13Patch-For-Review: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177#3017149 (10ssastry) >>! In T156177#3017144, @Volans wrote: > @ssastry: does this mean that https://gerrit.wikimedia.org/r/#/c/334452 can be reverted and restart the... [15:26:24] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:28] subbu: ack, thanks ^^^ [15:26:54] ok. [15:27:27] (03CR) 10Eevans: "Puppet compiler output here: http://puppet-compiler.wmflabs.org/5422/xenon.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/337034 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [15:28:45] <_joe_> urandom: need help merging that? [15:29:34] <_joe_> seems sensible [15:33:12] 06Operations, 10ops-eqiad, 10Traffic: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891#3017160 (10ema) ``` [Fri Feb 10 15:01:57 2017] bnx2x 0000:01:00.0 eth0: Warning: Unqualified SFP+ module detected, Port 0 from FINISAR CORP. part number FTLX1471D3BCL [Fri Feb 10... [15:34:31] _joe_: i think godog has it [15:34:48] _joe_: it's a reaction to a change of mine he merged that was bugged :/ [15:34:57] * urandom facepalms [15:34:57] (03CR) 10Muehlenhoff: Don't enable the Diamond ntpd collector if systemd-timesyncd is used (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) (owner: 10Muehlenhoff) [15:35:14] yeah I'll merge that now [15:35:33] (03CR) 10Filippo Giunchedi: [C: 032] Fix broken path to Prometheus exporter config [puppet] - 10https://gerrit.wikimedia.org/r/337034 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [15:35:40] (03PS3) 10Filippo Giunchedi: Fix broken path to Prometheus exporter config [puppet] - 10https://gerrit.wikimedia.org/r/337034 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [15:36:02] _joe_: but thanks! [15:36:46] PROBLEM - Check systemd state on cp1052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:37:30] (03PS4) 10Jcrespo: Add script to generate mysql client-only .deb package [software] - 10https://gerrit.wikimedia.org/r/337007 (https://phabricator.wikimedia.org/T157702) [15:38:08] _joe_: your change is ok to merge? [15:38:34] <_joe_> godog: yes, it's a noop at this point btw [15:39:00] (03CR) 10Jcrespo: [C: 032] Add script to generate mysql client-only .deb package [software] - 10https://gerrit.wikimedia.org/r/337007 (https://phabricator.wikimedia.org/T157702) (owner: 10Jcrespo) [15:41:06] (03PS10) 10Hashar: puppet parse validate from rake [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) [15:41:53] and the patch to run puppet parser validate / hiera check etc from rake is now ready to got ^^^^ ! [15:42:23] the last concern was running puppet-lint from the root of the repo would fail when submodules are checked out. But now all our submodules are passing puppet-lint / puppet parser validate and have CI jobs to enforce ice [15:42:23] s/ice/it [15:42:28] so that is ready to go \O/ [15:44:03] (03CR) 10Hashar: [V: 031 C: 031] "I verified locally that all syntax check work with submodules checked out. Can be tried locally with:" [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [15:46:40] 06Operations, 10Pybal, 10Traffic: lvs servers report 'Memory allocation problem' on bootup - https://phabricator.wikimedia.org/T82849#906173 (10BBlack) Yeah ipvsadm says "memory allocation problem" if you give it any kind of not-useful arguments (like delete a non-existent service, etc) [15:55:23] jynus: wanna land the final rake/puppet lint change now https://gerrit.wikimedia.org/r/331239 ? :) [15:55:26] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:56:21] one sec [15:56:26] busy right now [15:57:46] RECOVERY - Check systemd state on cp1052 is OK: OK - running: The system is fully operational [15:58:36] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:03:47] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3017229 (10ArielGlenn) Summary: ---------------------- 35T swift or equivalent (before replication) 30T (after raid) labs box for nfs mounts to labs, statst100* hosts. Should be... [16:05:47] (03PS1) 10Thcipriani: Beta: Add prometheus/jmx_exporter to scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/337038 [16:06:27] 06Operations, 10netops: netops: switch all subnets to use install1002/2002 as DHCP - https://phabricator.wikimedia.org/T156109#3017241 (10faidon) a:05mark>03faidon @Dzahn, change what specifically? DHCP relay? ACLs for TFTP? ACLs for webproxy, Ganglia etc.? [16:07:51] 06Operations, 10netops: Add firewall exception to get to wdqs[12]003.(codfw|eqiad).wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T157593#3017245 (10faidon) 05Open>03Resolved a:03faidon Done! [16:09:48] 06Operations, 10Analytics, 10Analytics-Cluster: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3017249 (10Ottomata) [16:10:06] 06Operations, 10Analytics, 10Analytics-Cluster: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3017036 (10Ottomata) [16:10:54] 06Operations, 10netops: netops: switch all subnets to use install1002/2002 as DHCP - https://phabricator.wikimedia.org/T156109#3017254 (10Dzahn) @Faidon I would like the DHCP relay changed from install1001 to install1002 and from install2001 to install2002. TFTP/webproxy/Ganglia should not need changes (that... [16:11:21] hashar: wanna remove zuul-merger from scandium now? [16:11:28] (03PS2) 10Dzahn: Remove zuul-merger from scandium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/337023 (https://phabricator.wikimedia.org/T150936) (owner: 10Hashar) [16:14:05] mutante: sure [16:14:20] (03CR) 10Dzahn: [C: 032] Remove zuul-merger from scandium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/337023 (https://phabricator.wikimedia.org/T150936) (owner: 10Hashar) [16:14:27] mutante: and git-daemon not starting was some oddity with systemd etc it is all sorted out now :} [16:14:35] hashar: :) cool! [16:15:24] !log scandium - stopping zuul-merger service (T150936) [16:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:29] T150936: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936 [16:16:23] mutante: I am cleaning up the host. Guess then you can wipe the server and move it back to spare/decom [16:16:52] (03CR) 10Elukey: Set hue allowed_hosts=* to work around bug http://community.cloudera.com/t5/Web-UI-Hue-Beeswax/New-Cloudera-installation-Hue-Bad-Request-400 (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/336906 (https://phabricator.wikimedia.org/T152714) (owner: 10Ottomata) [16:16:55] hashar: ok, i will follow-up with changes later to remove it from puppet and start the decom [16:17:30] ah i lack root on scandium bah [16:18:05] hashar: it says you are contint-roots [16:18:05] mutante: can you also dpkg --purge zuul while at it ? :) [16:18:06] PROBLEM - zuul_merger_service_running on scandium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [16:18:20] Removing zuul (2.5.0-8-gcbc7f62-wmf4jessie1) ... [16:18:23] and Icinga will raise some alarms on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=scandium [16:18:27] yep [16:18:35] + dpkg --purge git-daemon-sysvinit [16:18:53] and we are set. I cced robh to the task for the removal of the server [16:19:23] one less server Hurrah [16:19:24] if its a decom, add hw-requests project ;] [16:19:24] scheduled long downtime [16:19:32] will remove from icinga [16:19:46] Purging configuration files for git-daemon-sysvinit (1:2.1.4-2.1+deb8u+wmf1) ... [16:19:49] done [16:19:53] robh: done https://phabricator.wikimedia.org/T150936 :-) [16:19:53] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936#3017347 (10hashar) [16:19:54] awesome [16:20:07] cool, i pukc up the decoms a few times a week and cycle through them [16:20:12] so will review it later today =] [16:20:15] or move it back to spare [16:20:17] pick up even [16:20:31] or whatever you need. One sure thing CI no more needs scandium.eqiad.wmnet and it is no more being used \O/ [16:20:37] yeah, if its in warranty it'll go to spares, if its well out of warranty it'll decom yep thanks! [16:20:58] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003/analytics-store for mlitn - https://phabricator.wikimedia.org/T157812#3017349 (10matthiasmullie) [16:22:36] PROBLEM - git_daemon_running on scandium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user [16:23:03] ACKNOWLEDGEMENT - git_daemon_running on scandium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user daniel_zahn https://phabricator.wikimedia.org/T150936 [16:23:05] ACKNOWLEDGEMENT - zuul_merger_service_running on scandium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger daniel_zahn https://phabricator.wikimedia.org/T150936 [16:23:53] well I am heading back home. Have a good friday! [16:24:02] (03PS1) 10Dzahn: CI: decom scandium [puppet] - 10https://gerrit.wikimedia.org/r/337041 (https://phabricator.wikimedia.org/T150936) [16:24:05] hashar: just in time then. :) you too ^ [16:24:05] hashar: o/ [16:24:08] (03CR) 10jerkins-bot: [V: 04-1] CI: decom scandium [puppet] - 10https://gerrit.wikimedia.org/r/337041 (https://phabricator.wikimedia.org/T150936) (owner: 10Dzahn) [16:24:25] (03CR) 10Hashar: [C: 031] CI: decom scandium [puppet] - 10https://gerrit.wikimedia.org/r/337041 (https://phabricator.wikimedia.org/T150936) (owner: 10Dzahn) [16:24:36] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:27:14] (03PS3) 10ArielGlenn: dumps: More UI cleanup [puppet] - 10https://gerrit.wikimedia.org/r/335684 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [16:27:21] (03CR) 10jerkins-bot: [V: 04-1] dumps: More UI cleanup [puppet] - 10https://gerrit.wikimedia.org/r/335684 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [16:29:41] (03CR) 10Volans: "Nice! I was able to successfully run bundle exec rake lint." [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [16:32:30] (03PS4) 10ArielGlenn: dumps: More UI cleanup [puppet] - 10https://gerrit.wikimedia.org/r/335684 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [16:32:37] (03CR) 10jerkins-bot: [V: 04-1] dumps: More UI cleanup [puppet] - 10https://gerrit.wikimedia.org/r/335684 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [16:34:56] grrr [16:35:01] what does it not like? [16:35:13] I rebased manually, there was nothing funny about it [16:36:21] This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. [16:36:30] except a manual rebase did nothing different whatsoever [16:37:31] 06Operations, 10Analytics, 10Analytics-Cluster: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3017426 (10elukey) [16:37:34] (03PS2) 10Dzahn: CI: decom scandium [puppet] - 10https://gerrit.wikimedia.org/r/337041 (https://phabricator.wikimedia.org/T150936) [16:37:39] (03CR) 10jerkins-bot: [V: 04-1] CI: decom scandium [puppet] - 10https://gerrit.wikimedia.org/r/337041 (https://phabricator.wikimedia.org/T150936) (owner: 10Dzahn) [16:37:43] ssssigh [16:37:58] looks like it ain't just me [16:39:45] Jenkins clones in a mess? [16:40:23] It seems so [16:41:43] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:41:56] i am trying to contact hashar about it [16:42:03] but failing so far [16:42:10] I think he's heading home [16:42:21] Yeah, could be commuting back from his workspace [16:42:42] he just left a bit ago: [16:42:43] 16:24 <+ hashar> leaving for now. Have a good friday [16:42:53] i know. [16:43:06] I don't mind looking at it [16:43:12] But I'm not sure where that initial thing happens [16:43:33] all I know is what I read from his email when I woke up [16:43:33] I'm guessing any slave [16:43:36] I just got online [16:43:46] we shutdown scandium [16:43:53] mutante: check T157785 [16:43:53] T157785: zuul-merger git-daemon process is not start properly by systemd ? - https://phabricator.wikimedia.org/T157785 [16:43:55] and now contint1001 and 2001 are zuul mergers [16:44:02] i know all that [16:44:11] 08:16 < hashar> mutante: and git-daemon not starting was some oddity with systemd etc it is all sorted out now :} [16:44:24] 06Operations, 10Analytics, 10Analytics-Cluster: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3017442 (10Ottomata) [16:44:27] 06Operations, 10Analytics, 10Analytics-Cluster: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#3017443 (10Ottomata) [16:44:27] almost I would say [16:44:34] well, all I can do is call his cell, can you do that while I finish this 1:1, mutante ? [16:44:36] 06Operations, 10Analytics, 10Analytics-Cluster: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3017036 (10Ottomata) [16:44:49] that deamon is running [16:44:55] greg-g: i have already tried that [16:45:15] I don't have a backup number nor teleportation device :/ [16:46:20] 06Operations, 10Analytics, 10Analytics-Cluster: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#2952404 (10Ottomata) @MoritzMuehlenhoff, do you a preference re 'cdh' vs 'analytics-external'? 'analytics-external' might be more future proof, but what if we need... [16:51:35] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3017455 (10RobH) [16:53:08] (03PS1) 10Dzahn: Revert "Remove zuul-merger from scandium.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/337047 [16:53:15] (03CR) 10jerkins-bot: [V: 04-1] Revert "Remove zuul-merger from scandium.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/337047 (owner: 10Dzahn) [17:00:48] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/335684 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [17:00:55] (03CR) 10jerkins-bot: [V: 04-1] dumps: More UI cleanup [puppet] - 10https://gerrit.wikimedia.org/r/335684 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [17:01:50] mutante: I was going to suggest reverting the change like that, but I didn't know enough if that was feasible anymore :) [17:02:40] i don't want to make it worse by doing that [17:02:50] since it's about git data [17:02:58] and yes [17:03:20] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3014434 (10RStallman-legalteam) @ellery and @Nithum: Hi there, it's Rachel from legal. We do have a current MOU on file for Nithum, but not an NDA. We had begun updating the N... [17:06:54] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3017512 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2028.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic202... [17:10:43] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:11:08] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#3017534 (10faidon) [17:11:11] 06Operations, 10netops: netops: switch all subnets to use install1002/2002 as DHCP - https://phabricator.wikimedia.org/T156109#3017532 (10faidon) 05Open>03Resolved It actually required a bunch of ACL changes for TFTP/webproxy/Ganglia. All of these plus the DHCP relay have been adjusted now across eqiad/cod... [17:11:19] Reedy: o/ [17:11:26] hey [17:11:28] mutante: ^^ [17:11:40] hashar: oh , hello :) great you came back [17:11:41] I was commuting back home [17:11:45] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3017535 (10RobH) Please note that this shell request is blocked on two items: * The NDA must be resolved with legal. We'll be looking forward to an update from @RStallman-le... [17:11:45] yeah Sam paged me [17:11:58] so since we removed scandium things are failing [17:12:03] so I bet CI claims that patches can't be merged? [17:12:03] grr [17:12:05] :( [17:12:05] but a revert seemed a risk to make it worse [17:12:09] yes [17:12:11] yes that is exactly what it claims [17:12:54] the zuul-merger are daemon that register to the Gearman server merge:update and merge:merge functions [17:13:03] and there are only two registered contint2001 and contint1001 [17:13:11] I guess the new instances are not working so well :( [17:13:15] though I did monitor them today [17:13:16] i started copying /etc/zuul from contint1001 to restore it.. and stuff.. but did not do that [17:13:43] it is all maintained by puppet [17:13:49] any example of a change failing ? [17:13:51] yes, the service is running too [17:14:03] hashar: https://gerrit.wikimedia.org/r/#/c/337047/1 :p [17:14:07] * hashar tail -F /var/log/zuul/merger*.log on both contin1001 and contint2001 [17:14:20] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/337047 (owner: 10Dzahn) [17:14:27] (03CR) 10jerkins-bot: [V: 04-1] Revert "Remove zuul-merger from scandium.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/337047 (owner: 10Dzahn) [17:14:32] https://gerrit.wikimedia.org/r/335684 [17:14:33] UnboundLocalError: local variable 'repo' referenced before assignment [17:14:35] here's another one [17:15:12] GitCommandError: 'git clone -v ssh://jenkins-bot@gerrit.wikimedia.org:29418/operations/puppet /srv/zuul/git/operations/puppet' returned with exit code 128 [17:15:12] stderr: 'fatal: destination path '/srv/zuul/git/operations/puppet' already exists and is not an empty directory. [17:15:17] that is on contint2001 [17:16:21] it fails because a change for operations/puppet/mariadb got merged on that host [17:16:21] eh, but wasnt this all the same on scandium [17:16:25] that created '/srv/zuul/git/operations/puppet/mariadb [17:16:36] oh [17:16:40] and when later a change for operations/puppet runs on it, zuul-merge fails to clone because the dir already exists [17:16:49] I should have remembered about that bug sorry :( [17:16:57] well, i am glad you know the bug :) [17:17:11] so that specific one is fixed ( operations/puppet on contint2001 ) [17:17:21] but I guess I will to repopulate repos [17:17:38] and of course, find out a fix for zuul-merger so that instead of git clone it does a git init / git fetch etc [17:17:42] :( [17:17:43] so did this happen because the merge was in the moment when things were switched over? [17:17:49] workaround: [17:18:12] as 'zuul' user: rm -fR /srv/zuul/git/operations/puppet ; git clone -v ssh://jenkins-bot@gerrit.wikimedia.org:29418/operations/puppet /srv/zuul/git/operations/puppet [17:18:27] alright, gtk! [17:18:38] mutante: it fails because zuul-merger does not properly handle repositories sharing the same namespace :( [17:18:52] aha [17:19:03] I guess openstack never ran into it because their repos are usuall all namespace/project [17:19:07] when we have a much more complicated hierarchy [17:19:15] so yeah definitely want to fix that one for good [17:19:40] sorry about that :( [17:19:48] 06Operations, 10netops: netops: switch all subnets to use install1002/2002 as DHCP - https://phabricator.wikimedia.org/T156109#3017550 (10Dzahn) Thank you very much! [17:20:00] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/337047 (owner: 10Dzahn) [17:21:44] * hashar looks for the related phabricator task [17:24:45] !bug 1 | hashar [17:24:45] hashar: https://bugzilla.wikimedia.org/show_bug.cgi?id=1 [17:25:48] another way to fix https://phabricator.wikimedia.org/T138455#2401076 [17:32:19] Reedy: mutante and refilled it as https://phabricator.wikimedia.org/T157818 [17:32:35] with info about how to fix it up. Will try to figure out a proper fix tonight [17:32:58] hashar: ah! very nice. thank you! [17:33:03] I can't remember off hand how zuul init a repo. I think we had the same issue with zuul-cloner [17:33:05] and changed to use git init [17:33:39] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: elastic2028 fails to reimage - root device not found - https://phabricator.wikimedia.org/T157819#3017609 (10Gehel) [17:38:08] (03CR) 10RobH: [C: 032] Sam Tarling shell access + statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/336875 (owner: 10RobH) [17:38:14] (03PS3) 10RobH: Sam Tarling shell access + statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/336875 [17:42:04] man, are you sure someone is just playing games with people's names? [17:42:09] robh: ^ :) [17:42:43] yeah i think he knows hes ging to be confused as tim a lottttt [17:43:27] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3017658 (10RobH) 05stalled>03Resolved a:05RobH>03None Ok, access has been merged live. It'll take ~30 minutes for the user addition to filter though all... [17:44:12] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3017663 (10Nithum) @RobH: @ellery suggested analytics-privatedata-users would be the correct group above. When I receive the NDA, I'll pass it on to the Jigsaw team, but we... [17:45:51] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: elastic2028 fails to reimage - root device not found - https://phabricator.wikimedia.org/T157819#3017664 (10Gehel) 05Open>03Resolved a:03Gehel Fixed by adding manually adding `rootdelay=90` to the grub command line on first boot. (Thanks... [17:46:25] (03CR) 10ArielGlenn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/335684 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [17:46:31] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3017672 (10demon) >>! In T118154#3017229, @ArielGlenn wrote: > - cirrussearch 3.1T (@demon, do you know who uses these pimarily and what the demand is like?) Nope, not my data... [17:47:37] (03PS5) 10ArielGlenn: dumps: More UI cleanup [puppet] - 10https://gerrit.wikimedia.org/r/335684 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [17:49:28] (03CR) 10ArielGlenn: [C: 032] dumps: More UI cleanup [puppet] - 10https://gerrit.wikimedia.org/r/335684 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [17:49:45] (03CR) 10Hashar: [V: 031 C: 031] "> bundle exec rake lint" [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [17:56:28] (03PS1) 10RobH: add Matthias Mullie to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/337055 [17:57:03] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic20(27|28).codfw.wmnet [17:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:25] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003/analytics-store for mlitn - https://phabricator.wikimedia.org/T157812#3017692 (10RobH) 05Open>03stalled This one is pretty easy, but I'll outline the steps for the sake of clarity: [x] - user has signed the L3 documen... [18:03:49] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3017704 (10RobH) I'll prepare the patchset, thanks! (Once legal confirms on this task pending all the nda stuff, we should be good.) [18:06:29] 06Operations, 06Analytics-Kanban, 13Patch-For-Review, 03Scap3: Package + deploy new version of git-fat - https://phabricator.wikimedia.org/T155856#3017715 (10thcipriani) 05Open>03Resolved Thanks @Ottomata! [18:08:55] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3017726 (10EBernhardson) >>! In T118154#3017672, @demon wrote: >>>! In T118154#3017229, @ArielGlenn wrote: >> - cirrussearch 3.1T (@demon, do you know who uses these pimarily an... [18:08:56] !log renabling delayed replication for dbstore2001 T130128 [18:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:01] T130128: Fix dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T130128 [18:09:53] 06Operations, 10scap, 03Scap3: Trying to scap while l10nupdate is syncing shows unhelpful error - https://phabricator.wikimedia.org/T153278#3017743 (10thcipriani) 05Open>03Resolved a:03thcipriani Removed the backtrace from this error, output should be cleaner and should only show the `LockFailedError`... [18:19:58] (03PS1) 10Gehel: elasticsearch - reimage elastic20(29|30|31|32) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/337060 (https://phabricator.wikimedia.org/T151326) [18:22:13] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 03Scap3, 15User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3017802 (10mmodell) [18:22:28] 06Operations, 06Release-Engineering-Team (Long-Lived-Branches), 03Scap3: Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#3017807 (10mmodell) [18:22:41] (03CR) 10Volans: "Thanks for the details Antoine." [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [18:22:59] 06Operations, 10Icinga, 03Scap3: expose hosts in maintenance state so we can prevent scap from running on them - https://phabricator.wikimedia.org/T100777#3017822 (10mmodell) [18:23:08] 06Operations, 13Patch-For-Review, 03Scap3: Decide on /var/lib vs /home as locations of homedir for mwdeploy - https://phabricator.wikimedia.org/T86971#3017826 (10mmodell) [18:23:37] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic20(29|30|31|32).codfw.wmnet [18:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:42] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic20(29|30|31|32) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/337060 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [18:27:37] !log brion running throttled version of requeueTranscodes.php for low-res transcodes. expect increased load on video scalers but should remain responsive. [18:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:33] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3017852 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2029.codfw.wmnet'] ```... [18:30:39] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3017853 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2030.codfw.wmnet'] ```... [18:31:16] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3017854 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2031.codfw.wmnet'] ```... [18:31:24] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3017855 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2032.codfw.wmnet'] ```... [18:31:26] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 10Scap, 15User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3017856 (10mmodell) 05Open>03Resolved a:03mmodell [18:36:52] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3017865 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2032.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic203... [18:37:04] (03CR) 10BryanDavis: "Maybe split this into two separate patches, one that introduces the /etc/toollabs-cronhost file that is a simple merge and then one that r" [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999) [18:42:57] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3017868 (10ArielGlenn) @Ebernhardson, can you get that file via http for processing or do you need it to be on something that pretends to be a local filesystem? [18:44:26] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3017869 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2032.codfw.wmnet'] ```... [18:46:32] (03CR) 10Zhuyifei1999: "alright" [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999) [18:46:40] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:48:20] RECOVERY - Host mw1236 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [18:51:06] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3017891 (10EBernhardson) I think we've only ever grabbed it via http [18:51:40] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:56:18] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3017898 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2029.codfw.wmnet'] ``` and were **ALL** successful. [19:00:38] (03PS3) 10Zhuyifei1999: toollabs: Preparing to move `/usr/local/bin/crontab` to labs/toollabs [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) [19:01:00] 06Operations, 10ops-eqiad: mw1236 powered down and not able to powerup - https://phabricator.wikimedia.org/T156610#3017902 (10Cmjohnson) @elukey Both PSU's must have taken a spike because they both were off. I had to reseat the PSU"s and drain any flea power. Once I did this plugged the psu back into the serv... [19:02:20] (03PS4) 10Zhuyifei1999: toollabs: Preparing to move `/usr/local/bin/crontab` to labs/toollabs [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) [19:03:55] (03PS5) 10Zhuyifei1999: toollabs: Preparing to move `/usr/local/bin/crontab` to labs/toollabs [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) [19:10:02] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3017922 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2032.codfw.wmnet'] ``` and were **ALL** successful. [19:13:25] (03PS1) 10Jdlrobson: Disable Hungarian Popups A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337064 (https://phabricator.wikimedia.org/T156290) [19:18:11] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:19:59] (03PS3) 10Faidon Liambotis: aptrepo: add suite stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/336386 [19:20:01] (03PS1) 10Faidon Liambotis: autoinstall: add stretch [puppet] - 10https://gerrit.wikimedia.org/r/337065 [19:28:39] (03PS1) 10Faidon Liambotis: autoinstall: switch d-i-test to stretch [puppet] - 10https://gerrit.wikimedia.org/r/337067 [19:28:54] (03CR) 10Faidon Liambotis: [C: 032] aptrepo: add suite stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/336386 (owner: 10Faidon Liambotis) [19:29:03] (03CR) 10Faidon Liambotis: [C: 032] autoinstall: add stretch [puppet] - 10https://gerrit.wikimedia.org/r/337065 (owner: 10Faidon Liambotis) [19:29:15] (03CR) 10Faidon Liambotis: [V: 032 C: 032] autoinstall: switch d-i-test to stretch [puppet] - 10https://gerrit.wikimedia.org/r/337067 (owner: 10Faidon Liambotis) [19:32:46] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:36:21] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3018033 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2030.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic203... [19:43:14] (03PS1) 10Cmjohnson: Adding elastic1048-1052 to dhcpd [puppet] - 10https://gerrit.wikimedia.org/r/337071 [19:46:44] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3018054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2031.codfw.wmnet'] ``` and were **ALL** successful. [19:47:06] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [19:51:18] (03CR) 10Cmjohnson: [C: 032] Adding elastic1048-1052 to dhcpd [puppet] - 10https://gerrit.wikimedia.org/r/337071 (owner: 10Cmjohnson) [19:51:27] (03PS2) 10Cmjohnson: Adding elastic1048-1052 to dhcpd [puppet] - 10https://gerrit.wikimedia.org/r/337071 [19:51:33] (03CR) 10Cmjohnson: [V: 032 C: 032] Adding elastic1048-1052 to dhcpd [puppet] - 10https://gerrit.wikimedia.org/r/337071 (owner: 10Cmjohnson) [19:52:32] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3018066 (10Cmjohnson) [19:53:39] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#2954751 (10Cmjohnson) dhcpd file updated for a jessie install but will leave that for @gehel [19:53:55] (03CR) 10Krinkle: "expanddblist has existed for a while and is a regular part of the toolkit used by people with deployment access. Any SWAT member should be" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334462 (owner: 10Krinkle) [19:54:06] (03PS4) 10Krinkle: Don't use computed dblist in production (nowikidatadescriptiontaglines) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334462 [19:54:25] (03CR) 10Krinkle: "I'll roll out this no-op sometimes on Monday - after verifying it is still up to date." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334462 (owner: 10Krinkle) [19:55:16] PROBLEM - puppet last run on mc1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:56:46] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:11:37] !log silence graphite1001 for ssd reinstall - T157022 [20:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:43] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [20:14:48] (03Abandoned) 10Dzahn: Revert "Remove zuul-merger from scandium.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/337047 (owner: 10Dzahn) [20:16:27] (03PS1) 10Faidon Liambotis: autoinstall: add virtual.cfg to d-i-test [puppet] - 10https://gerrit.wikimedia.org/r/337074 [20:16:29] (03PS1) 10Faidon Liambotis: autoinstall: pass net.ifnames=0 to stretch d-i [puppet] - 10https://gerrit.wikimedia.org/r/337075 [20:17:14] (03CR) 10Faidon Liambotis: [V: 032 C: 032] autoinstall: add virtual.cfg to d-i-test [puppet] - 10https://gerrit.wikimedia.org/r/337074 (owner: 10Faidon Liambotis) [20:17:54] (03PS2) 10Faidon Liambotis: autoinstall: add virtual.cfg to d-i-test [puppet] - 10https://gerrit.wikimedia.org/r/337074 [20:17:56] (03PS2) 10Faidon Liambotis: autoinstall: pass net.ifnames=0 to stretch d-i [puppet] - 10https://gerrit.wikimedia.org/r/337075 [20:18:14] (03CR) 10Faidon Liambotis: [V: 032 C: 032] autoinstall: pass net.ifnames=0 to stretch d-i [puppet] - 10https://gerrit.wikimedia.org/r/337075 (owner: 10Faidon Liambotis) [20:18:45] (03CR) 10Faidon Liambotis: [V: 032 C: 032] autoinstall: add virtual.cfg to d-i-test [puppet] - 10https://gerrit.wikimedia.org/r/337074 (owner: 10Faidon Liambotis) [20:19:53] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:20:53] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:22:20] Hey, Can I deploy this today? https://gerrit.wikimedia.org/r/337076 Special:Nuke is completely broken and we don't have SWAT window today [20:23:02] greg-g: Also pinging you regarding this ^ [20:23:13] RECOVERY - puppet last run on mc1018 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:26:51] (03PS1) 10Filippo Giunchedi: install_server: reinstall graphite1001 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/337077 (https://phabricator.wikimedia.org/T157022) [20:29:12] (03PS1) 10Krinkle: phpunit: Add test to verify computed lists are up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337078 [20:29:24] (03CR) 10Filippo Giunchedi: [C: 032] install_server: reinstall graphite1001 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/337077 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [20:36:46] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#3018185 (10Cmjohnson) all 4 disks have been swapped. The server is on and accessible via mgmt [20:37:17] (03CR) 10Volans: [C: 032] tests: Use sample data that doesn't match production names [software/conftool] - 10https://gerrit.wikimedia.org/r/327686 (owner: 10Krinkle) [20:38:02] (03Merged) 10jenkins-bot: tests: Use sample data that doesn't match production names [software/conftool] - 10https://gerrit.wikimedia.org/r/327686 (owner: 10Krinkle) [20:38:33] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic20(29|30|31|32).codfw.wmnet [20:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:17] Amir1: go ahead [20:43:37] Thanks! [20:43:38] (I went through 2 versions of "no" before getting there, fwiw) [20:44:38] greg-g: why? If it wasn't Friday, I would've have waited for SWAT [20:45:08] Because it has been broken for over a week and a half already, getting it out today doens't seem critical [20:45:15] and those 3 tasks are all untriaged [20:45:22] so no one has decided they were UBN! [20:45:40] but, your change looks fairly safe, so to unbreak it, I'm OK with it [20:45:43] (03PS1) 10Ottomata: Include geoip on refinery hosts [puppet] - 10https://gerrit.wikimedia.org/r/337081 [20:46:15] (03Abandoned) 10Ottomata: Include geoip on refinery hosts [puppet] - 10https://gerrit.wikimedia.org/r/337081 (owner: 10Ottomata) [20:46:40] Thanks, this reasoning helps me [20:46:44] (03PS1) 10Ottomata: Include geoip on refinery hosts [puppet] - 10https://gerrit.wikimedia.org/r/337082 [20:47:09] anytime :) [20:47:19] I'm always willing to explain reasoning :) [20:47:37] I have a kid, I'm used to at least 5 levels depth of "why?" [20:47:51] (2 actually, but only one that talks) [20:48:33] greg-g: https://www.youtube.com/watch?v=4u2ZsoYWwJA This is one of my absolute favorite stand-up comedies [20:49:06] Highly recommended, it's about reasoning to children [20:49:11] :) [20:50:35] (03PS3) 10Dzahn: CI: decom scandium [puppet] - 10https://gerrit.wikimedia.org/r/337041 (https://phabricator.wikimedia.org/T150936) [20:51:51] (03CR) 10Krinkle: [C: 032] phpunit: Add test to verify computed lists are up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337078 (owner: 10Krinkle) [20:53:28] (03Merged) 10jenkins-bot: phpunit: Add test to verify computed lists are up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337078 (owner: 10Krinkle) [20:53:41] (03CR) 10jenkins-bot: phpunit: Add test to verify computed lists are up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337078 (owner: 10Krinkle) [20:54:08] (03CR) 10Ottomata: [C: 032] Include geoip on refinery hosts [puppet] - 10https://gerrit.wikimedia.org/r/337082 (owner: 10Ottomata) [20:54:21] (03PS2) 10Dzahn: install/TFTP: use install1002 and install2002 as next-servers [puppet] - 10https://gerrit.wikimedia.org/r/336959 [20:55:01] (03PS1) 10Dzahn: remove install1001/install2001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/337084 [20:56:47] (03PS5) 10Krinkle: Don't use computed dblist in production (nowikidatadescriptiontaglines) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334462 [20:58:03] (03PS3) 10Dzahn: install/TFTP: use install1002 and install2002 as next-servers [puppet] - 10https://gerrit.wikimedia.org/r/336959 (https://phabricator.wikimedia.org/T84380) [20:58:23] (03PS2) 10Dzahn: remove install1001/install2001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/337084 (https://phabricator.wikimedia.org/T84380) [20:59:17] Confirmed it's working on mwdebug1002 [20:59:41] [= [21:00:12] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#3018234 (10EBernhardson) AFAICT the settings update has finished on all the indices in eqiad and codfw clusters with no errors. [21:00:59] (03PS4) 10Dzahn: install/TFTP: use install1002 and install2002 as next-servers [puppet] - 10https://gerrit.wikimedia.org/r/336959 (https://phabricator.wikimedia.org/T84380) [21:01:16] (03CR) 10Dzahn: [C: 032] install/TFTP: use install1002 and install2002 as next-servers [puppet] - 10https://gerrit.wikimedia.org/r/336959 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [21:02:10] (03PS5) 10Dzahn: install/DHCP/TFTP: use install1002 and install2002 as next-servers [puppet] - 10https://gerrit.wikimedia.org/r/336959 (https://phabricator.wikimedia.org/T84380) [21:02:20] (03CR) 10Dzahn: [V: 032 C: 032] install/DHCP/TFTP: use install1002 and install2002 as next-servers [puppet] - 10https://gerrit.wikimedia.org/r/336959 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [21:02:58] !log ladsgroup@tin:/srv/mediawiki-staging$ scap sync-file php-1.29.0-wmf.11/extensions/Nuke/Nuke_body.php '[[gerrit:337076]] Fixing Special:Nuke (T156112, T156949, T156314)' [21:03:00] !log ladsgroup@tin Synchronized php-1.29.0-wmf.11/extensions/Nuke/Nuke_body.php: [[gerrit:337076]] Fixing Special:Nuke (T156112, T156949, T156314) (duration: 00m 58s) [21:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:05] T156314: Special:Nuke not selecting based on filters, instead last 500 new files - https://phabricator.wikimedia.org/T156314 [21:03:05] T156112: Mass delete only works with default values on Special:Nuke - https://phabricator.wikimedia.org/T156112 [21:03:05] T156949: Filtering on Username on Special:Nuke does not work - https://phabricator.wikimedia.org/T156949 [21:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:40] (03CR) 10Dzahn: [C: 032] remove install1001/install2001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/337084 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [21:04:48] Everything looks fine [21:05:08] (03PS1) 10Faidon Liambotis: aptrepo: add new RSA 4096 apt key [puppet] - 10https://gerrit.wikimedia.org/r/337089 [21:05:42] * addshore would love to deploy something too ;) just a style fix though, everything still functions) https://gerrit.wikimedia.org/r/#/c/337030/ ;) << greg-g Amir1 [21:05:51] (03CR) 10Faidon Liambotis: [V: 032 C: 032] aptrepo: add new RSA 4096 apt key [puppet] - 10https://gerrit.wikimedia.org/r/337089 (owner: 10Faidon Liambotis) [21:06:04] but im sure that can probably wait :P [21:07:24] (03PS3) 10Dzahn: remove install1001/install2001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/337084 (https://phabricator.wikimedia.org/T84380) [21:07:51] (03CR) 10Dzahn: [V: 032 C: 032] remove install1001/install2001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/337084 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [21:08:01] (03PS4) 10Dzahn: remove install1001/install2001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/337084 (https://phabricator.wikimedia.org/T84380) [21:15:21] (03PS1) 10Dzahn: remove install1001 and install2001, keep 2001 mgmt [dns] - 10https://gerrit.wikimedia.org/r/337093 (https://phabricator.wikimedia.org/T84380) [21:15:29] (03CR) 10jerkins-bot: [V: 04-1] remove install1001 and install2001, keep 2001 mgmt [dns] - 10https://gerrit.wikimedia.org/r/337093 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [21:18:27] !log install1001, install2001 - revoke puppet certs, puppet node deactivate, delete salt keys (T84380, T132757) [21:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:34] T84380: Setup install server in codfw - tftp done, but not apt and other install services - https://phabricator.wikimedia.org/T84380 [21:18:34] T132757: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757 [21:18:57] 06Operations: re-create install2001 as a VM - https://phabricator.wikimedia.org/T156440#3018314 (10Dzahn) [21:19:25] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#3018317 (10Dzahn) [21:19:27] 06Operations: re-create install2001 as a VM - https://phabricator.wikimedia.org/T156440#2974819 (10Dzahn) 05Open>03Resolved This has happened. install2002 replaced install2001. install2001 will be decom'ed. [21:24:14] 06Operations, 10hardware-requests: spare ex4200s - check on quantity for potential shipment to OIT - https://phabricator.wikimedia.org/T157839#3018327 (10RobH) [21:26:19] f [21:27:47] !log install1001, install2001 - removed from Icinga, shutting down (T84380, T132757) [21:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:52] T84380: Setup install server in codfw - tftp done, but not apt and other install services - https://phabricator.wikimedia.org/T84380 [21:27:52] T132757: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757 [21:29:50] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:31:16] 06Operations, 06DC-Ops: decom install2001 - https://phabricator.wikimedia.org/T157840#3018351 (10Dzahn) [21:33:32] 06Operations, 06DC-Ops: decom install2001 - https://phabricator.wikimedia.org/T157840#3018375 (10Dzahn) removed from puppet https://gerrit.wikimedia.org/r/#/c/337084/ removed from install https://gerrit.wikimedia.org/r/#/c/336959/ ganglia switched https://gerrit.wikimedia.org/r/#/c/336362/ etc.. 21:27 mut... [21:34:17] (03PS2) 10Dzahn: remove install1001 and install2001, keep 2001 mgmt [dns] - 10https://gerrit.wikimedia.org/r/337093 (https://phabricator.wikimedia.org/T84380) [21:34:28] (03CR) 10jerkins-bot: [V: 04-1] remove install1001 and install2001, keep 2001 mgmt [dns] - 10https://gerrit.wikimedia.org/r/337093 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [21:36:10] (03PS3) 10Dzahn: remove install1001 and install2001, keep 2001 mgmt [dns] - 10https://gerrit.wikimedia.org/r/337093 (https://phabricator.wikimedia.org/T84380) [21:46:52] (03PS1) 10Faidon Liambotis: Replace 'zsh-beta' with 'zsh' [puppet] - 10https://gerrit.wikimedia.org/r/337153 [21:46:54] (03PS1) 10Faidon Liambotis: autoinstall: also pass net.ifnames=0 to the end system [puppet] - 10https://gerrit.wikimedia.org/r/337154 [21:48:51] (03CR) 10Faidon Liambotis: [C: 032] Replace 'zsh-beta' with 'zsh' [puppet] - 10https://gerrit.wikimedia.org/r/337153 (owner: 10Faidon Liambotis) [21:49:02] (03CR) 10Faidon Liambotis: [C: 032] autoinstall: also pass net.ifnames=0 to the end system [puppet] - 10https://gerrit.wikimedia.org/r/337154 (owner: 10Faidon Liambotis) [21:50:30] (03PS1) 10Nuria: [WIP] Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) [21:51:21] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [21:52:43] brion, https://phabricator.wikimedia.org/T157028 I tried to sort out the failed transcode by exitcode, is it useful? [21:52:55] should I continue? [21:53:24] I thought it might help to find a pattern [21:54:09] yannf: can't hurt, at least helps distinguish between the resources limits and other problems [21:54:12] Thanks! [21:54:36] I'm currently filling in missing transcodes [21:54:55] ok [21:54:57] Will tackle more of the remaining errors after that [21:55:26] The new queue change seems to help with responsiveness on new uploads :) [21:56:03] yes, sure [21:56:08] that's great [21:57:38] yannf brion: fwiw https://commons.wikimedia.org/wiki/User:Dispenser/Transcode_errors [21:57:55] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:58:10] Nice [21:58:29] zhuyifei1999_, it needs to be updated [21:58:50] poke Dispenser [21:58:51] that was before Dispenser put back every failed transcodes into the queue [21:59:05] PROBLEM - configured eth on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:59:15] PROBLEM - dhclient process on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:59:16] zhuyifei1999_, he is not here [21:59:35] PROBLEM - puppet last run on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:59:40] he might be later [21:59:45] PROBLEM - salt-minion processes on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:59:58] or you can leave a message on talk page or whatever [22:01:35] PROBLEM - Check systemd state on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:01:45] PROBLEM - DPKG on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:01:55] PROBLEM - Disk space on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:02:12] ignore those [22:02:17] d-i-test is clearly test :) [22:03:04] (03CR) 10Krinkle: [WIP] Changes to perf consumer of event logging events (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [22:04:58] (03CR) 10Krinkle: [WIP] Changes to perf consumer of event logging events (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [22:08:42] (03PS1) 10Andrew Bogott: Toollabs: Remove zsh from package list [puppet] - 10https://gerrit.wikimedia.org/r/337184 [22:08:54] (03PS1) 10Faidon Liambotis: salt: add missing import to grain-ensure.py [puppet] - 10https://gerrit.wikimedia.org/r/337185 [22:09:41] paravoid: fyi, ^^ is a followup to your recent patch [22:10:27] andrewbogott: oh, oops [22:10:42] (03CR) 10Faidon Liambotis: [C: 032] Toollabs: Remove zsh from package list [puppet] - 10https://gerrit.wikimedia.org/r/337184 (owner: 10Andrew Bogott) [22:11:43] (03PS2) 10Faidon Liambotis: salt: add missing import to grain-ensure.py [puppet] - 10https://gerrit.wikimedia.org/r/337185 [22:11:48] (03CR) 10Dzahn: [C: 032] remove install1001 and install2001, keep 2001 mgmt [dns] - 10https://gerrit.wikimedia.org/r/337093 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [22:11:59] (03CR) 10Faidon Liambotis: [V: 032 C: 032] salt: add missing import to grain-ensure.py [puppet] - 10https://gerrit.wikimedia.org/r/337185 (owner: 10Faidon Liambotis) [22:13:23] zhuyifei1999_, Dispenser said on IRC that he will update this list when the current queue is empty [22:14:11] it's quite sensible, otherwise it would have to be done again after a few days [22:17:16] !log install1001 - shutdown ganeti instance and deleting it and its disk (T132757) [22:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:21] T132757: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757 [22:18:52] (03CR) 10Hashar: [V: 031 C: 031] "There are indeed 463 ruby files (find . -name '*.rb'|wc -l)" [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [22:20:32] 06Operations, 06DC-Ops, 13Patch-For-Review: decom install2001 - https://phabricator.wikimedia.org/T157840#3018449 (10Dzahn) [22:20:57] 06Operations, 06DC-Ops, 13Patch-For-Review: decom install2001 - https://phabricator.wikimedia.org/T157840#3018351 (10Dzahn) a:05Dzahn>03None [22:21:13] 06Operations, 06DC-Ops: decom install2001 - https://phabricator.wikimedia.org/T157840#3018351 (10Dzahn) [22:21:31] 06Operations, 10ops-codfw, 06DC-Ops: decom install2001 - https://phabricator.wikimedia.org/T157840#3018351 (10Dzahn) [22:23:03] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [22:23:36] 06Operations, 13Patch-For-Review: Setup install server in codfw - tftp done, but not apt and other install services - https://phabricator.wikimedia.org/T84380#3018459 (10Dzahn) this is done for: TFTP DHCP webproxy just APT will be pointing to just eqiad for the moment [22:24:11] 06Operations: Setup basic infrastructure services in codfw - https://phabricator.wikimedia.org/T84350#3018461 (10Dzahn) [22:24:14] 06Operations, 13Patch-For-Review: Setup install server in codfw - tftp done, but not apt and other install services - https://phabricator.wikimedia.org/T84380#3018460 (10Dzahn) 05stalled>03Open [22:24:58] 06Operations, 13Patch-For-Review: Setup install server in codfw - tftp done, but not apt and other install services (now: DHCP, TFTP, webproxy done, just not APT) - https://phabricator.wikimedia.org/T84380#926563 (10Dzahn) [22:27:19] 06Operations, 10ops-codfw, 06DC-Ops, 10hardware-requests: decom install2001 - https://phabricator.wikimedia.org/T157840#3018463 (10RobH) [22:29:17] !log carbon - stopping puppet and most services, adding deprecation warning to motd, rsyncing data one last time (T132757) [22:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:22] T132757: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757 [22:35:03] RECOVERY - DPKG on graphite1001 is OK: All packages OK [22:35:23] RECOVERY - MD RAID on graphite1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [22:36:33] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [22:36:37] that's me ^ [22:36:54] 06Operations, 10ops-codfw, 06DC-Ops, 10hardware-requests: decom install2001 - https://phabricator.wikimedia.org/T157840#3018488 (10RobH) a:03Papaul [22:38:23] PROBLEM - salt-minion processes on carbon is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:42:36] !log start rsync of whisper metrics graphite2001 -> graphite1001 - T157022 [22:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:41] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [22:44:58] gerrit.wikimedia.org is loading very slowly for me. [22:45:18] mutante or any RainbowSprinkles ^^ [22:45:30] This site can’t be reached [22:45:30] gerrit.wikimedia.org took too long to respond. [22:46:23] RECOVERY - salt-minion processes on carbon is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:46:29] gerrit.wikimedia.org is down [22:46:32] according to http://www.isitdownrightnow.com/gerrit.wikimedia.org.html [22:48:03] paladox: back [22:48:12] it was one of those 2 minute outages that heal themselves [22:48:13] gerrits down for me/ [22:48:16] oh [22:48:24] works now [22:48:42] Quick for me [22:48:51] Random who knows what [22:48:57] It's a friday, of course gerrit would do that [22:49:00] yea, same effect we saw before [22:49:37] oh [22:50:28] and it's down again [22:50:31] mutante ^^ [22:50:39] 06Operations, 10ops-esams, 10hardware-requests: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883#3018549 (10RobH) disabled the network ports for the powered off systems cp3011-3022 robh@csw2-esams# show | compare [edit interfaces xe-5/0/0] + disable; [edit interfaces xe-5/0/1] +... [22:50:45] RainbowSprinkles ^^ [22:51:01] loads again. Hmm, goes down every 2 mins. [22:51:10] when that happens its just the stirrings of the gerrit singularity. [22:51:43] when it becomes fully conscious it will either cherish or destroy RainbowSprinkles for being its caretaker until now [22:51:49] oh, and it stops loading again. I doint think it's my internet. as i can load other websites. [22:52:01] (03PS1) 10Faidon Liambotis: salt: use SHA256 master key fingeprint on newer systems [puppet] - 10https://gerrit.wikimedia.org/r/337189 [22:52:32] Yep not my internet [22:52:51] I am trying on my mobile phone provider and it also takes a long time to load it on 4g. [22:54:15] paladox: I can load gerrit well... [22:54:22] yep works again [22:54:33] but it seems that it does it every 2 mins and then stops. [22:54:47] paladox: ticket on phab? [22:54:52] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018553 (10greg) [22:54:53] can triage as ub now or something [22:54:57] greg-g: nice [22:54:58] robh: Btw, I run gerrit from my laptop, cobalt is just a facade. Sssh don't tell anyone ;-) [22:55:01] Yep i will do that now. I will do high. [22:56:03] paladox: done [22:56:06] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018555 (10JustBerry) p:05Normal>03Unbreak! @greg Important enough. [22:56:08] (03CR) 10jerkins-bot: [V: 04-1] salt: use SHA256 master key fingeprint on newer systems [puppet] - 10https://gerrit.wikimedia.org/r/337189 (owner: 10Faidon Liambotis) [22:56:13] oh thanks. [22:56:14] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018558 (10Paladox) p:05Unbreak!>03High Happened again. This time it is different as we have switched gc off. But cpu looks high. Happened every 2 minutes in the last 20... [22:56:30] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018560 (10Paladox) p:05High>03Unbreak! [22:56:34] paladox: ?? [22:56:40] Conflicts [22:56:58] i was writing a comment when you published so it was a conflict. [22:57:51] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018561 (10JustBerry) @Andrew @daniel mentioned similar issues over IRC not long ago. [22:57:57] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018564 (10Paladox) [22:58:07] paladox: why is that task now UBN!? [22:58:30] greg-g I did not set it as UBN, i just put it back to what JustBerry set it as. [22:58:43] greg-g: I'll remove it [22:58:45] But i was going to set it as high because it happened again tonight. [22:58:50] oh, yeah, it was JustBerry [22:58:56] greg-g: I put it that way [22:59:02] it's been affecting at least 3 other users [22:59:05] "it happening again" doesn't mean "UBN!" by default [22:59:07] andrew, etc. [22:59:14] We know [22:59:14] greg-g: well... [22:59:20] so what shall I do ;p [22:59:35] we're working on it, see that task, there are open patches [22:59:48] greg-g: set it to high or back to normal? [22:59:55] hey mafk [23:00:18] PROBLEM - Check systemd state on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:00:23] JustBerry: it's between normal and high, I'd lean normal, I don't care enough/it doesn't actually effect anything [23:00:28] PROBLEM - DPKG on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:00:44] hi [23:00:48] PROBLEM - Disk space on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:00:55] is gerrit down? [23:01:28] PROBLEM - configured eth on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:01:33] greg-g: is maintenance being done on gerrit or something? It's down for me :) [23:01:38] PROBLEM - dhclient process on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:01:38] mafk no [23:01:47] ERR_TIMED_OUT [23:01:48] PROBLEM - puppet last run on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:01:49] it's down for me. [23:02:08] PROBLEM - salt-minion processes on d-i-test is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:02:14] known issue then? [23:02:22] I have just reported it. [23:02:34] ktnx [23:03:03] hey all! gerrit is unusably slow. is that what you have been talking about [23:03:04] ? [23:03:09] it was loading for me a moment ago, now it is not [23:03:11] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018571 (10JustBerry) Task seems important in that it is affecting users' (such as andrew's) abilities to upload critical patches, such as patches relevant to the issues highl... [23:03:27] DanielK_WMDE: see above ^^ [23:03:28] paladox: can you CC me [23:03:30] ? [23:03:30] greg-g: it has been slow for at least half an hour, seems to be getting worse [23:03:31] Ok [23:03:39] mafk: on the ticket? [23:03:40] new phab design fwiw it seems [23:03:41] DanielK_WMDE: yep [23:03:43] yup [23:03:44] most stuff gets through eventually [23:03:49] mafk you mean to https://phabricator.wikimedia.org/T148478? [23:04:30] that one it seems [23:04:34] the current gerrit one [23:04:40] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018573 (10JustBerry) ``` 18:03 DanielK_WMDE: greg-g: it has been slow for at least half an hour, seems to be getting worse ... 18:03 greg-g: DanielK_WMDE: yep 18:03 mafk: yup... [23:05:35] Whats you username on phab? [23:05:37] mafk ^^ [23:06:08] PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [23:06:17] hmm ^^ [23:06:18] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:06:26] paladox: done ^^ [23:06:29] is that normal ^^ [23:06:29] paladox: I'll add myself :) [23:06:32] I did it [23:06:32] ok [23:06:34] mafk: ;p [23:06:37] JustBerry added a subscriber: MarcoAurelio. [23:06:38] PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:06:38] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:06:38] PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:06:38] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:06:39] nope i mean [23:06:40] PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [23:06:41] wew [23:06:47] Hmm [23:06:49] paladox: let's move from operations [23:06:50] thats the gerrit server [23:07:08] RainbowSprinkles ^^ [23:07:08] PROBLEM - Unmerged changes on repository puppet on labtestcontrol2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:07:08] PROBLEM - Unmerged changes on repository puppet on puppetmaster1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:07:17] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#2731899 (10JustBerry) ``` 18:06 icinga-wm: PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 18:06 ic... [23:07:25] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018577 (10daniel) Some observations: pushing takes about a minute, but gets through eventually. Auto-compelte for reviewers is broken in the UI (times out, i guess). Everyth... [23:07:48] DanielK_WMDE: thanks for +ing that [23:08:01] np [23:08:08] RECOVERY - Check whether ferm is active by checking the default input chain on cobalt is OK: OK ferm input default policy is set [23:08:13] DanielK_WMDE: can I perform a test upload on a project? or create a new one? [23:08:18] don't have any changes to upload atm ;p [23:08:29] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018579 (10Paladox) also this just popped up PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: ERROR ferm input dr... [23:08:47] cpu all over the place on cobalt [23:09:00] JustBerry: feel free to mess with https://gerrit.wikimedia.org/r/337011 [23:09:42] JustBerry: are you just guessing at what to look for? Chad is currently investigating the issue [23:09:58] greg-g: aight. [23:09:58] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_statistics_mediawiki] [23:10:38] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [23:10:38] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [23:10:38] RECOVERY - Unmerged changes on repository puppet on labcontrol1001 is OK: No changes to merge. [23:10:38] RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge. [23:10:58] RECOVERY - Unmerged changes on repository puppet on puppetmaster1002 is OK: No changes to merge. [23:10:58] RECOVERY - Unmerged changes on repository puppet on labtestcontrol2001 is OK: No changes to merge. [23:11:01] JustBerry: thanks though :) [23:11:08] RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge. [23:11:28] greg-g: that seems gerrit related^^ ;p icinga-wm that is [23:11:55] yeah, you mentioned that on the task [23:12:06] greg-g: different series but yeah [23:12:23] !log gerrit: restarting service [23:12:25] * addshore waves goodbye to gerrit [23:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:50] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018587 (10JustBerry) Quick update, if I may: ``` 18:10 icinga-wm: RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. 18:10 icinga-wm: RE... [23:13:07] ...this ticket is quietly killing me with the resulting IRC pings... [23:13:32] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018588 (10demon) Yes, that's a cascading issue. We routinely get puppet failures when Gerrit/Git is down [23:13:43] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slowdowns - https://phabricator.wikimedia.org/T148478#3018589 (10Paladox) Those most likely are using gerrit to clone repo's. Which means if gerrit goes down then puppet fails on those hosts as they will be unable to clone. [23:13:46] slowdown: :pp hah [23:13:48] slowdown: change slowdowns to slow-downs ? [23:13:58] Good call :) [23:14:10] Change your nick? ;-) [23:14:18] don't pick a common technical use word as an IRC nick? [23:14:21] :) [23:14:27] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3018590 (10Paladox) [23:14:32] :D [23:14:47] Heh, I don't usually run into so many sequential pings as a consequence of the nick :D [23:14:50] * Gerrit gets all the pings [23:15:00] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [23:15:04] * mafk pkill Gerrit [23:15:07] /nick PROBLEM [23:15:12] mutante: hah! [23:15:15] So, turning it off and on again fixed it [23:15:32] :) [23:15:45] not loading for me. [23:15:50] at least we still have that trick in our bag [23:15:50] me neither [23:15:59] Still looking down for me [23:16:05] re-restart? [23:16:20] (still down for me as well fwiw) [23:16:23] Have you tried turning it off and on and off and on again? [23:16:43] Java, the Windows of programming languages [23:16:53] CPU still pegged after restart [23:17:01] hug of death [23:17:04] Reedy, you just made me spit water all over my laptop... [23:17:29] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=cobalt.wikimedia.org&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=Miscellaneous+eqiad [23:17:34] It's spending a ton of time in wait [23:17:43] wait times. [23:17:48] Plus the obvious spikes [23:18:16] lots of wait -> i/o issue? [23:19:37] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3018597 (10JustBerry) [23:19:42] loads for me again :) [23:19:48] Same here! ;) [23:19:49] (03PS2) 10Dzahn: add prometheus1003/1004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/336354 (https://phabricator.wikimedia.org/T152504) [23:19:54] ^ i can use it [23:20:38] DanielK_WMDE: I'm not sure. Jvm debugging is a dark art [23:21:37] hmhm [23:21:43] anyway, seems all better now! [23:22:02] DanielK_WMDE: I gave it a stern talking to [23:22:41] (03Draft2) 10MarcoAurelio: Adding "Categoria:" as namespace alias for ext.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337192 [23:22:49] (03Draft1) 10MarcoAurelio: Adding "Categoria:" as namespace alias for ext.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337192 [23:22:51] (03PS3) 10Dzahn: add prometheus1003/1004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/336354 (https://phabricator.wikimedia.org/T152504) [23:24:34] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3018599 (10JustBerry) + `2017-02-10 23:12 RainbowSprinkles: gerrit: restarting service` to https://wikitech.wikimedia.org/wiki/Server_Admin_Log. After the restart, a handful... [23:24:48] "GC Allocation failure" or red herring like the other times [23:27:58] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [23:28:13] Oh, same error like last time? [23:28:17] I though we disabled gc? [23:28:22] though = thought [23:29:41] git gc yes, jvm gc no [23:29:44] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3018602 (10Paladox) p:05Unbreak!>03High As it is fixed now we can lower it to high. [23:29:50] oh [23:30:25] so i saw some of those in in /srv/gerrit/jvmlogs while it was slow [23:30:33] but we also thought this before [23:30:55] oh [23:30:55] 2017-02-10T23:30:33.634+0000: 1069.538: [GC (Allocation Failure) [23:31:10] jvm gc can pause applications [23:31:29] yea, Total time for which application threads were stopped: 0.0006232 seconds, Stopping threads took: 0.0000235 seconds and stuff [23:31:39] Would jvm g1 be an improvement in that it will speed up the time it pauses the application. [23:31:46] yep [23:32:09] So, g1 won't hurt and we could definitely do it [23:32:40] I wish I knew what was causing gerrit to flip out in gc though. [23:32:56] Ok. I've tested g1 i think, didnt fail for me though. [23:33:10] yea, you tried in labs [23:33:48] RainbowSprinkles it's when all the memory is used up that we gave to the application. [23:34:01] it will run jvm gc to try and clean it up. [23:34:16] Yeah, but that's weird why it would happen. The machine it's on now has more memory than any prior machine it's been on :) [23:34:30] or that's what looks like from https://ganglia.wikimedia.org/latest/?r=day&c=Miscellaneous+eqiad&h=cobalt.wikimedia.org [23:34:59] RainbowSprinkles giving it to much heap can be as bad as giving it to little. [23:35:23] The amount of heap is fine, i'm just baffled as to what's exhausting the heap [23:35:33] * paladox too. [23:36:08] (03PS3) 10MarcoAurelio: Adding "Categoria:" as namespace alias for ext.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337192 (https://phabricator.wikimedia.org/T157846) [23:36:59] (03PS1) 10Chad: Gerrit: Stop stuffing so many cache things into memory [puppet] - 10https://gerrit.wikimedia.org/r/337193 [23:37:04] mutante: That might also help a tad ^ [23:37:08] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:37:08] Definitely won't make anything worse [23:37:51] (03CR) 10Paladox: [C: 031] "Looks ok and this is what we had on our old old gerrit server" [puppet] - 10https://gerrit.wikimedia.org/r/337193 (owner: 10Chad) [23:38:18] Plus I wanna move to redis anyway, no point in using memory caches really [23:38:18] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:39:08] ok, let's do that [23:39:15] (03CR) 10Dzahn: [C: 032] Gerrit: Stop stuffing so many cache things into memory [puppet] - 10https://gerrit.wikimedia.org/r/337193 (owner: 10Chad) [23:40:14] RainbowSprinkles: are you handling cobalt side? done on puppetmaster [23:40:20] Yeah, will do [23:40:25] I have a theory that the conflicts cache is pretty high-traffic [23:40:30] And that's causing gerrit to flip out [23:40:36] ok [23:41:32] !log gerrit: Restarting to pick up config changes [23:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:58] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [23:43:49] Gonna watch it for a bit, but I've got a funny feeling it'll help [23:44:22] let's link that change to the ticket [23:45:19] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3018631 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/337193/ https://gerrit.wikimedia.org/r/#/c/337193/1/modules/gerrit/templates/gerrit.config.erb [23:47:53] (03PS2) 10Dzahn: switch apt.wm.org from carbon to install1002 [dns] - 10https://gerrit.wikimedia.org/r/335734 (https://phabricator.wikimedia.org/T132757) [23:49:49] Need a longer view of the CPU/memory usage, but seems happier [23:49:57] * RainbowSprinkles calms down [23:54:01] :) [23:54:18] (03PS5) 10Madhuvishy: nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) [23:55:34] (03CR) 10jerkins-bot: [V: 04-1] nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [23:56:10] (03CR) 10Dzahn: [C: 032] "tested to install/update/upgrade from planet2001 with install1002 directly in sources list. deb http://install1002.wikimedia.org/wikimed" [dns] - 10https://gerrit.wikimedia.org/r/335734 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [23:57:59] (03PS6) 10Madhuvishy: nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870)