[00:02:03] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [100000000.0] [00:09:52] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [02:16:17] (03Abandoned) 10Yuvipanda: base: Increase ephemeral port range everywhere [puppet] - 10https://gerrit.wikimedia.org/r/253508 (owner: 10Yuvipanda) [02:25:08] (03PS5) 10Alex Monk: [WIP] deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) [02:26:26] (03CR) 10jenkins-bot: [V: 04-1] [WIP] deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [02:26:56] Krenair: I guess I should wait for you to get off the WIP tag before leaving comments? [02:28:55] (03CR) 10Alex Monk: "Cleaned up PS4 cron on mira and deployment-tin" [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [02:30:05] (03PS6) 10Alex Monk: [WIP] deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) [02:35:05] YuviPanda, you can leave comments now [02:38:15] (03CR) 10Yuvipanda: [WIP] deployment-prep: keyholder shinken monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [02:42:17] (03CR) 10Alex Monk: [WIP] deployment-prep: keyholder shinken monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [02:43:06] (03PS7) 10Alex Monk: [WIP] deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) [02:43:15] YuviPanda, should I expect it to have put data in graphite by now? [02:44:35] Krenair: yeah, after a couple of minutes usually [02:45:01] I'm looking at the tree on http://graphite.wmflabs.org/ [02:45:11] I see deployment-prep.deployment-tin.SSHSessionsCollector [02:45:19] but nothing for KeyholderStatusCollector [02:46:09] Krenair: check diamond logs? [02:46:44] last entry is [2016-02-10 22:54:20,895] [MainThread] Stopped task scheduler. [02:47:02] in /var/log/diamond/diamond.log [02:48:31] is it currently running? [02:48:56] I see /usr/share/diamond/collectors/Keyholder/Keyholder.py [02:49:06] yep: diamond 28331 1 0 02:27 ? 00:00:02 /usr/bin/python /usr/bin/diamond --foreground --skip-fork --skip-pidfile [02:49:29] hmm. [02:51:06] hmm, not sure. I also dunno why there are no logs for such a long time [02:51:33] /var/log/upstart/diamond.log contains some stuff [02:51:42] and is getting updated [02:51:58] it's mostly full of "No NFS mount points were found" [02:54:11] would it matter that I didn't override get_default_config in my collector to set method threaded? [02:54:40] not sure. It has been too long since I did anything. [02:55:01] Krenair: add a log call in the code, restart diamond see if it shows up? [02:56:13] my script definitely gets run [03:00:10] Krenair: I wonder if it is returning 0 and that's confusing diamond (shouldn't) [03:00:20] but I'm only peanut-gallerying [03:00:56] It doesn't appear to be calling the collect function [03:02:16] aha. [03:02:40] (03CR) 10Yuvipanda: [WIP] deployment-prep: keyholder shinken monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [03:02:43] YuviPanda, the class name has to be the name given the puppet class [03:02:51] with 'Collector' appended [03:03:11] Now it gets UNKNOWN: You do not have permission to list /etc/keyholder.d. [03:04:16] (03PS8) 10Alex Monk: [WIP] deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) [03:04:22] aaah, I guess you might need a sudo rule? [03:04:26] yes [03:05:34] (03CR) 10jenkins-bot: [V: 04-1] [WIP] deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [03:07:31] (03PS9) 10Alex Monk: [WIP] deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) [03:16:55] (03PS10) 10Alex Monk: deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) [03:17:58] (03CR) 10jenkins-bot: [V: 04-1] deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [03:24:39] YuviPanda, okay, so... [03:24:43] OSError: [Errno 2] No such file or directory [03:25:00] from things like subprocess.Popen('/usr/bin/sudo echo hi') [03:25:45] ah. [03:26:38] Krenair: look at modules/diamond/files/collector/minimalpuppetagent.py [03:26:43] found the problem [03:26:51] I needed to use a list of arguments [03:26:53] it uses sudo as well. [03:27:22] (03PS11) 10Alex Monk: deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) [03:36:44] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.42 seconds [03:38:34] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.25 seconds [05:22:31] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.30 seconds [05:32:20] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [06:31:02] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: puppet fail [06:31:11] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:20] PROBLEM - Hadoop NodeManager on analytics1030 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:33:42] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:52] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:52] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:42] RECOVERY - Hadoop NodeManager on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:55:12] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:56:11] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:20] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:56:50] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:25] ACKNOWLEDGEMENT - cassandra service on restbase1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Muehlenhoff node has been decomissioned, see T95253 and SAL [06:58:01] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:01] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:00:30] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 61 failures [07:02:45] !log changing binlog_format to MIXED on db2018 [07:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:11:51] (03PS1) 10Jcrespo: Set db2018 binlog_format to MIXED; testing it as the reason for lag [puppet] - 10https://gerrit.wikimedia.org/r/283939 [07:18:26] (03CR) 10Jcrespo: [C: 032] Set db2018 binlog_format to MIXED; testing it as the reason for lag [puppet] - 10https://gerrit.wikimedia.org/r/283939 (owner: 10Jcrespo) [07:28:01] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures [07:31:42] (03PS5) 10Elukey: Override kafkatee's default logrotate/rsyslog configuration. [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) [07:33:24] (03PS6) 10Elukey: Override kafkatee's default logrotate/rsyslog configuration. [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) [07:35:25] (03PS1) 10Muehlenhoff: Use require_package in archiva role [puppet] - 10https://gerrit.wikimedia.org/r/283941 [07:37:55] (03CR) 10Elukey: "This is a proposal for a compromise to allow code re-use outside wikimedia and encapsulation for our modules. logrotate and rsyslog config" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) (owner: 10Elukey) [07:39:18] (03PS7) 10Elukey: Override kafkatee's default logrotate/rsyslog configuration. [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) [07:53:11] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 705 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5078001 keys - replication_delay is 705 [07:53:22] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:53:55] <_joe_> interesting [07:55:01] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5043714 keys - replication_delay is 33 [07:57:43] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [08:03:03] (03CR) 10Hashar: [C: 031] deployment-prep shinken: deployment-salt is no longer the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/283783 (owner: 10Alex Monk) [08:23:07] 06Operations, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2213435 (10Volans) @jcrespo I've quickly parametrized the variables and added the time for each wiki check. Checking that it was working I found a minor bug in the previous version and two strange things. The bug was c... [08:23:25] (03PS1) 10Volans: DBtools: add script to check external storage [software] - 10https://gerrit.wikimedia.org/r/283946 (https://phabricator.wikimedia.org/T130702) [08:27:15] 06Operations, 10Analytics-EventLogging, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2213440 (10fgiunchedi) btw the alert still shows up as UNKNOWN in icinga since 2d `Throughput of EventLogging NavigationTiming events UNKNOWN 2... [08:27:56] (03PS1) 10Jcrespo: Update comments to mark as master only the local one to the dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283947 (https://phabricator.wikimedia.org/T124699) [08:29:54] (03CR) 10Jcrespo: "Note that both dump and vslow is assumed inactive, those servers have to be lowered in weight in case a codfw terbium is activated." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283947 (https://phabricator.wikimedia.org/T124699) (owner: 10Jcrespo) [08:33:43] 06Operations, 06Performance-Team: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2213447 (10Joe) >>! In T129963#2207420, @ori wrote: >>>! In T129963#2179841, @Joe wrote: >> I don't think measuring latencies for memcached (where they are usually around 1 ms) is th... [08:39:06] 06Operations, 06Performance-Team: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2213452 (10Joe) @ori great job, I think 97% is a good (but not optimal) result; it would be interesting to see if we can push that up with different settings, and how much that would... [08:42:37] 06Operations: cronspam from argon - apache2 logrotate - https://phabricator.wikimedia.org/T132896#2213456 (10elukey) [08:49:45] 06Operations, 10DBA, 13Patch-For-Review: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2213475 (10jcrespo) I think you are getting out of scope here- I did not ask to check mediawiki's data integrity- that is complex and we should stick to use mediawiki tools for that (there are lik... [08:54:33] ACKNOWLEDGEMENT - puppet last run on restbase1006 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi decommissioning [08:57:49] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) (owner: 10Nicko) [09:04:34] (03CR) 10Jcrespo: [C: 032] Update comments to mark as master only the local one to the dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283947 (https://phabricator.wikimedia.org/T124699) (owner: 10Jcrespo) [09:07:12] sync-masters is taking more than usual [09:07:35] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Update master comments in preparation for dc failover (duration: 01m 41s) [09:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:08:13] !log jynus@tin Synchronized wmf-config/db-codfw.php: Update master comments in preparation for dc failover (duration: 00m 26s) [09:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:08:42] I see some exceptions [09:11:54] 06Operations, 10DBA, 13Patch-For-Review: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2213565 (10Volans) My only problem is that I'm not sure that the blobs I'm checking were created on that period of time, that's the only thing I want to be sure of. [09:17:12] !log repool restbase2006 after raid expansion [09:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:17:49] 06Operations, 10DBA, 13Patch-For-Review: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2213575 (10jcrespo) Give it a guess, then double the timeframe. E.g. estimate the number of edits per day, multiply by 30, then double it. [09:19:10] !log depool restbase100[56].eqiad.wmnet, about to get decomissioned [09:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:20:12] <_joe_> !log upgrading HHVM on terbium [09:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:21:02] 06Operations, 10Analytics-EventLogging, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2213659 (10elukey) ``` elukey@stat1002:~$ kafkacat -L -b kafka1012.eqiad.wmnet:9092 | grep -i navigation topic "eventlogging_NavigationTiming"... [09:24:24] 06Operations, 07HHVM: upgrade HHVM to 3.12.1 on terbium - https://phabricator.wikimedia.org/T132751#2213662 (10Joe) 05Open>03Resolved [09:25:33] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2213677 (10Joe) I am waiting until we switch back mediawiki from codfw before I definitively decommission the last batch of appse... [09:34:03] (03Abandoned) 10Jcrespo: Make codfw db masters as the masters of all datacenters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277754 (https://phabricator.wikimedia.org/T124699) (owner: 10Jcrespo) [09:34:25] (03PS1) 10Giuseppe Lavagetto: cache: switch to codfw for restbase/citoid/cxserver [puppet] - 10https://gerrit.wikimedia.org/r/283951 [09:36:08] (03CR) 10Gehel: Improve robustness of es-tool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [09:39:09] jouncebot, next [09:39:15] jouncebot: next [09:39:19] meh [09:39:24] (03CR) 10Gehel: [C: 031] "Looks good. Thanks @Nicko. Will merge asap..." [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [09:40:28] <_joe_> Luke081515: no releases this week [09:40:43] ah, that's the reason, ok :D [09:40:51] <_joe_> we have a one-week code freeze during the datacenter switchover test [09:41:07] I guess someone is not happy to have some strange errors at his log ;) [09:41:09] <_joe_> we'd prefer to deal with just the WTFs we will find when switching over [09:41:26] <_joe_> than with those with bugs superimposed in some way :P [09:41:35] 06Operations, 10Analytics-EventLogging, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2213691 (10elukey) Cancelled the above comment since it was outdated by https://gerrit.wikimedia.org/r/#/c/283673/2/modules/eventlogging/manifes... [09:41:39] <_joe_> new bugs, I mean [09:43:42] !log installing openssh updates on jessie systems [09:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:50] (03PS1) 10Giuseppe Lavagetto: switchover: block maintenance scripts from running in eqiad [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/283952 [09:48:15] (03PS1) 10Jcrespo: Put eqiad in read-only mode for datacenter switchover to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283953 (https://phabricator.wikimedia.org/T124699) [09:48:51] (03PS1) 10Giuseppe Lavagetto: switchover: enable maintenance scripts in codfw [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/283954 [09:49:11] (03CR) 10Jcrespo: [C: 04-2] "To be deployed at Tuesday, April 19th, 14:00 UTC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283953 (https://phabricator.wikimedia.org/T124699) (owner: 10Jcrespo) [09:49:38] (03CR) 10jenkins-bot: [V: 04-1] switchover: block maintenance scripts from running in eqiad [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/283952 (owner: 10Giuseppe Lavagetto) [09:51:09] <_joe_> wat? [09:54:45] (03CR) 10Mobrovac: [C: 031] cache: switch to codfw for restbase/citoid/cxserver [puppet] - 10https://gerrit.wikimedia.org/r/283951 (owner: 10Giuseppe Lavagetto) [10:01:20] <_joe_> ok let's go [10:02:32] (03CR) 10Giuseppe Lavagetto: [C: 032] "Switching as scheduled" [puppet] - 10https://gerrit.wikimedia.org/r/283951 (owner: 10Giuseppe Lavagetto) [10:02:50] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: processor/client-side-04 forwarder/legacy-zmq [10:03:20] ---^ checking, just restarted a kafka node for an upgrade [10:04:20] <_joe_> running puppet on the cache nodes in eqiad [10:04:49] kk [10:06:39] <_joe_> I will not switch the mediawiki load - it will switch naturally when we switch over mediawiki [10:07:34] <_joe_> !log traffic from eqiad caches switched to codfw for restbase,citoid,cxserver [10:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:46] ok, so this is relevant only for external traffic [10:07:48] (03PS1) 10Muehlenhoff: Setup meitnerium as the jessie-based archiva host [puppet] - 10https://gerrit.wikimedia.org/r/283956 [10:09:26] <_joe_> yes [10:12:10] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2213725 (10mark) [10:12:59] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [10:16:03] --^ EventLogging still has issues when a kafka broker restarts (so the kafka brokers metadata changes), we thought to have fixed the problem but it is still there [10:16:24] _joe_: state looking good so far [10:16:53] so it is querying the active datacenter, right, not yet codfw? [10:16:56] <_joe_> mobrovac: yeah seems the same here [10:17:00] for mediawiki I mean [10:17:07] <_joe_> jynus: it's calling the mw api in eqiad, yes [10:17:11] ok [10:17:31] <_joe_> that will change when we change $app_routes['mediawiki'] tomorrow [10:17:36] <_joe_> in puppet [10:18:13] do we need to create a list of hosts to force puppet on? [10:20:02] <_joe_> we have it in the instructions [10:20:37] ok [10:20:53] I'm filling mine, slowly [10:28:49] (03PS1) 10Mobrovac: RESTBase: remove rb100[56] from the list of eqiad seeds [puppet] - 10https://gerrit.wikimedia.org/r/283958 [10:45:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [10:45:50] (03PS1) 10Giuseppe Lavagetto: elasticsearch: enable ganglia in codfw [puppet] - 10https://gerrit.wikimedia.org/r/283960 [10:46:27] <_joe_> investigating the 5xx surge [10:55:18] (03CR) 10Giuseppe Lavagetto: "@chasemp: is there any reason why ganglia is not enabled in codfw besides being not correctly defined in the ganglia config?" [puppet] - 10https://gerrit.wikimedia.org/r/283960 (owner: 10Giuseppe Lavagetto) [10:56:11] (03CR) 10Faidon Liambotis: [C: 04-1] "The flag is fine I guess, but not using rsyslog::conf is not. I'm in favor of reusable well-abstracted modules, but not if this means not " [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) (owner: 10Elukey) [10:57:34] (03CR) 10Faidon Liambotis: [C: 032] "Merge at your convenience." [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) (owner: 10Gehel) [10:57:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:58:15] (03CR) 10Faidon Liambotis: [C: 031] Reuse update-initramfs in lvs::balancer and interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/283678 (owner: 10Ema) [10:59:21] (03CR) 10Faidon Liambotis: [C: 031] "Well, I'm in favor of this, assuming it works :)" [puppet] - 10https://gerrit.wikimedia.org/r/283623 (owner: 10Alexandros Kosiaris) [11:01:13] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/283958 (owner: 10Mobrovac) [11:07:03] (03PS8) 10Elukey: Override kafkatee's default logrotate/rsyslog configuration. [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) [11:09:51] (03CR) 10Elukey: "Trying to make a compromise among all the comments:" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) (owner: 10Elukey) [11:17:24] (03PS1) 10Muehlenhoff: Enable base::firewall for rhenium [puppet] - 10https://gerrit.wikimedia.org/r/283961 [11:26:49] PROBLEM - puppet last run on mw2027 is CRITICAL: CRITICAL: Puppet has 1 failures [11:30:07] !log restbase restarting eqiad after https://gerrit.wikimedia.org/r/#/c/283958/ [11:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:31:08] (03CR) 10Filippo Giunchedi: [C: 031] servermon: Add Krenair (Alex Monk) to servermon users [puppet] - 10https://gerrit.wikimedia.org/r/283623 (owner: 10Alexandros Kosiaris) [11:32:04] (03CR) 10Filippo Giunchedi: [C: 031] "there's a slight change of semantics (rebuild initramfs for all kernels vs only the latest) but doesn't seem to be a problem" [puppet] - 10https://gerrit.wikimedia.org/r/283678 (owner: 10Ema) [11:33:10] RECOVERY - puppet last run on restbase2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:37:09] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: puppet fail [11:47:03] "Internal Server Error" on https://ticket.wikimedia.org/ [11:48:12] taking a look [11:49:31] RECOVERY - cassandra service on restbase1006 is OK: OK - cassandra is active [11:50:21] RECOVERY - puppet last run on restbase1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:51:53] Thibaut120094: looks recovered now [11:53:29] RECOVERY - puppet last run on mw2027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:54:02] nope still getting the error message [11:54:38] the error message only appears when you're logged. [11:55:40] PROBLEM - cassandra service on restbase1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [11:57:32] Thibaut120094: ah, I believe Alex might be on it but don't see him on irc atm [11:57:47] ok [11:58:47] I can login fine... [11:58:51] Thibaut120094: What are you trying to do on otrs? [11:59:02] Ah, viewing a ticket [11:59:07] hacking the website [11:59:11] lol [11:59:17] godog: Looks to be broken when you drill down to view a ticket [11:59:23] yeah [12:00:31] yup I'm trying to reach Alex [12:00:41] godog: "perl? fuck that!" :P [12:00:56] Have a look at the apache log for anything obvious [12:01:39] aye, Message: Kernel::System::HTMLUtils could not be loaded: Attempt to reload Kernel/System/HTMLUtils.pm aborted. [12:01:42] Compilation failed in require at /opt/otrs/Kernel/System/ObjectManager.pm line 191. [12:02:09] I think that's related to https://phabricator.wikimedia.org/T132822 [12:02:19] Are their some processes pegging the cpu? [12:02:40] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:02:42] the reason I'm trying to reach akosiaris is that I see him logged on mendelevium and tailing the apache error log too [12:02:46] yeah one is pegged [12:05:40] PROBLEM - RAID on db1046 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [12:07:50] PROBLEM - puppet last run on db2033 is CRITICAL: CRITICAL: puppet fail [12:15:13] Thibaut120094: should be ok now [12:15:26] akosiaris fixed it, told me to relay, he's having IRC troubles [12:15:32] cc: godog, Reedy [12:15:34] thanks! [12:15:45] oh ok, thanks paravoid ! [12:18:39] RECOVERY - cassandra service on restbase1006 is OK: OK - cassandra is active [12:20:14] ACKNOWLEDGEMENT - cassandra CQL 10.64.48.100:9042 on restbase1006 is CRITICAL: Connection refused Filippo Giunchedi decom [12:24:39] PROBLEM - cassandra service on restbase1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [12:29:59] RECOVERY - puppet last run on db2033 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:39:36] 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2066342 (10Reedy) https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom;TicketID=9091759 ``` The issues we... [12:48:51] RECOVERY - cassandra service on restbase1006 is OK: OK - cassandra is active [12:55:20] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: puppet fail [13:16:50] 06Operations: Remove unused Samba packages - https://phabricator.wikimedia.org/T132915#2213985 (10MoritzMuehlenhoff) [13:16:54] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 6 failures [13:18:44] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:15] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:22:45] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:23:03] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:23] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [13:27:13] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:27:56] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:53] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:31:45] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:43] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [13:34:04] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [13:36:14] 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2066342 (10Platonides) This seems to have started since the last certificate change (not before 2015-12-10). What chain... [13:43:24] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:48:09] (03CR) 10Rush: [C: 031] "historical reasonings from the time period w/ discovery but no reason to not do it." [puppet] - 10https://gerrit.wikimedia.org/r/283960 (owner: 10Giuseppe Lavagetto) [13:53:27] 06Operations: API apache servers OOMing: mw1134 mw1132 mw1139 mw1138 - https://phabricator.wikimedia.org/T132845#2214030 (10Andrew) p:05Unbreak!>03Normal [13:55:39] 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2214035 (10BBlack) >>! In T128182#2214024, @Platonides wrote: > This seems to have started since the last certificate c... [13:56:45] 06Operations, 10ops-eqiad: db1047.eqiad.wmnet: slot=3 failed - https://phabricator.wikimedia.org/T132917#2214037 (10fgiunchedi) [13:57:23] PROBLEM - configured eth on labvirt1002 is CRITICAL: Connection refused by host [13:57:28] ACKNOWLEDGEMENT - RAID on db1046 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Filippo Giunchedi https://phabricator.wikimedia.org/T132917 [13:57:33] PROBLEM - Disk space on labvirt1002 is CRITICAL: Connection refused by host [13:57:33] PROBLEM - nova-compute process on labvirt1002 is CRITICAL: Connection refused by host [13:57:44] PROBLEM - RAID on labvirt1002 is CRITICAL: Connection refused by host [13:57:44] PROBLEM - DPKG on labvirt1002 is CRITICAL: Connection refused by host [13:57:48] (03PS1) 10Jcrespo: Depool parsercaches #1 for cloning pc1004 -> pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283972 [13:57:53] PROBLEM - kvm ssl cert on labvirt1002 is CRITICAL: Connection refused by host [13:58:05] PROBLEM - dhclient process on labvirt1002 is CRITICAL: Connection refused by host [13:58:22] (03CR) 10Jcrespo: [C: 032] Depool parsercaches #1 for cloning pc1004 -> pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283972 (owner: 10Jcrespo) [13:58:24] PROBLEM - salt-minion processes on labvirt1002 is CRITICAL: Connection refused by host [13:58:35] PROBLEM - puppet last run on labvirt1002 is CRITICAL: Connection refused by host [13:59:24] hm… what does it mean that labvirt1002 is refusing contact with icinga but I can ssh in just fine? [14:00:43] RECOVERY - salt-minion processes on labvirt1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:00:54] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:01:43] RECOVERY - configured eth on labvirt1002 is OK: OK - interfaces up [14:01:46] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool pc2004 (duration: 00m 27s) [14:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:01:53] RECOVERY - Disk space on labvirt1002 is OK: DISK OK [14:01:53] RECOVERY - nova-compute process on labvirt1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:02:04] RECOVERY - RAID on labvirt1002 is OK: OK: no RAID installed [14:02:05] RECOVERY - DPKG on labvirt1002 is OK: All packages OK [14:02:13] RECOVERY - kvm ssl cert on labvirt1002 is OK: Cert /etc/ssl/localcerts/labvirt-star.eqiad.wmnet.crt will not expire for at least 90 days [14:02:25] RECOVERY - dhclient process on labvirt1002 is OK: PROCS OK: 0 processes with command name dhclient [14:02:25] Looks like nrpe crashed. That's a new one to me [14:02:35] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool pc1004 (duration: 00m 29s) [14:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:06] !log puppet agent -tv on labvirt1002 which restarted nrpe and resolved icinga alerts [14:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:25] (03PS1) 10Giuseppe Lavagetto: heartbeat: run the script if enabled, kill it if not [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/283974 [14:07:07] <_joe_> jynus: ^^ step 1, step two is to change the actual puppet code we're using when upgrading the ref to the submodule [14:07:30] 06Operations, 06Commons, 06Multimedia, 10Traffic: Commons API fails (413 error) to upload file within 100MB threshold - https://phabricator.wikimedia.org/T86436#2214064 (10Slowking4) this is a problem for uploading books with IA moving away from dejavu the bloated jpegs will be larger important copyright... [14:07:37] 06Operations, 10Analytics-EventLogging, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2214065 (10fgiunchedi) fwiw looks like it is known now but warning for the last 4h ```Throughput of EventLogging NavigationTiming events WARNIN... [14:08:17] just drop the /etc/init.d/pt-heartbeat file entirely [14:08:55] <_joe_> oh that too, yes [14:09:09] but do not run it if you have a conditional command [14:09:24] that will create a continuos kill or not? [14:10:16] "if the pid doesn't exist, kill it" [14:10:20] akosiaris: ready for clinic hand-off? (I have nothing to report, really — calendar is up to date and there are no pending access requests) [14:10:33] andrewbogott: yeah ok [14:11:56] 06Operations, 10Analytics-EventLogging, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2214066 (10elukey) @fgiunchedi: the above warning is it likely related to the issue that I was experiencing this morning with Event Logging afte... [14:12:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor nitpic, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [14:13:45] PROBLEM - HHVM rendering on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:14:22] (03PS2) 10Giuseppe Lavagetto: heartbeat: run the script if enabled, kill it if not [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/283974 [14:14:23] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:14:43] PROBLEM - configured eth on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:14:43] PROBLEM - DPKG on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:14:54] PROBLEM - RAID on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:14:54] PROBLEM - Disk space on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:03] PROBLEM - salt-minion processes on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:14] PROBLEM - HHVM processes on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:16] (03PS9) 10Ladsgroup: ores: Add support for running precached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [14:15:34] PROBLEM - nutcracker port on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:53] PROBLEM - SSH on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:55] PROBLEM - nutcracker process on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:16:15] PROBLEM - puppet last run on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:16:25] PROBLEM - Check size of conntrack table on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:17:10] jynus: I'd like to make some progress with labs dns stability this week. Would you have time to work on https://phabricator.wikimedia.org/T128737 ? [14:17:35] <_joe_> andrewbogott: we have the codfw switchover this week [14:17:37] andrewbogott: this week is the codfw switchover week [14:17:40] heh [14:17:45] <_joe_> I doubt any core ops will have time [14:17:48] good point, I guess that means no one has time for anything :) [14:17:51] :) [14:17:57] andrewbogott, this week not at all [14:18:18] but I probably will next week [14:18:41] jynus: ok! thank you [14:19:53] (03CR) 10Faidon Liambotis: [C: 04-1] "Sounds good to me, but please also use logrotate::conf :)" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) (owner: 10Elukey) [14:20:57] that doesn't really need a full mysql setup [14:21:23] i mean, pdns as that is setup just caches some data there [14:21:36] !log upgraded pykafka on hafnium and restarted statsv [14:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:47] andrewbogott: I think you're overthinking this :) [14:22:13] PROBLEM - dhclient process on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:22:28] it isn't just cached. Records are synced but the domain data is persistent in pdns [14:23:37] and designate writes domains directly to mysql. So it needs to be mysql to keep designate happy [14:25:13] elukey: shall we talk logrotate rsyslog and modules :D ? [14:25:26] i see you are getting conflicting advice :) [14:25:33] RECOVERY - nutcracker port on mw1146 is OK: TCP OK - 0.000 second response time on port 11212 [14:25:36] (03PS3) 10Giuseppe Lavagetto: heartbeat: run the script if enabled, kill it if not [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/283974 [14:25:44] RECOVERY - SSH on mw1146 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [14:25:54] RECOVERY - nutcracker process on mw1146 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:26:04] RECOVERY - dhclient process on mw1146 is OK: PROCS OK: 0 processes with command name dhclient [14:26:12] ottomata: they call me the -1 collector [14:26:14] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 44 minutes ago with 0 failures [14:26:24] RECOVERY - Check size of conntrack table on mw1146 is OK: OK: nf_conntrack is 0 % full [14:26:36] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add support for running precached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [14:26:44] RECOVERY - configured eth on mw1146 is OK: OK - interfaces up [14:26:44] RECOVERY - DPKG on mw1146 is OK: All packages OK [14:26:45] RECOVERY - RAID on mw1146 is OK: OK: no RAID installed [14:26:46] (03PS10) 10Alexandros Kosiaris: ores: Add support for running precached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [14:26:54] RECOVERY - Disk space on mw1146 is OK: DISK OK [14:26:55] RECOVERY - salt-minion processes on mw1146 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:27:05] RECOVERY - HHVM processes on mw1146 is OK: PROCS OK: 6 processes with command name hhvm [14:27:20] 06Operations, 10Analytics-EventLogging, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2214074 (10Ottomata) Ah! So there are some services that use this stuff on hafnium managed by the performance team. I noticed that the pykafka... [14:27:22] ottomata: anyhow, yes it would be great to figure out how to proceed. Placing stuff in the role might be an option but then I'd be concerned about where to put the configuration files. [14:27:23] (03CR) 10Alexandros Kosiaris: [V: 032] ores: Add support for running precached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [14:27:42] elukey: is there even a kafkatee role? [14:28:30] ah logging::kafkatee [14:28:43] ja that should be moved into the role modules if we are going to do it [14:28:45] elukey: so ok [14:29:04] i'm not 100% opposed to using rsyslog::conf in the module....but if we do that...there is no reason for it to be a submodule [14:29:11] for kafkatee, i care very little [14:29:26] as there is very little likelyhood for that module to be used outside of ops/puppet [14:29:41] varnishkafka also not too bad, although I think it is useful as a submodule in mediawiki-vagrant [14:29:45] so i care a little more there [14:30:14] elukey: another idea [14:30:27] is to move the rsyslog::conf stuff into another class or define in the module [14:30:31] and include it optionally [14:30:43] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 17.24% of data above the critical threshold [100000000.0] [14:30:44] that way the main classes are still useful outside of ops puppet [14:30:55] but if you are in ops/puppet, you can choose to use ops/puppet dependent stuff [14:31:39] ottomata: atm in code review the rsyslog::conf (and soon the logrotate::conf) stuff are guarded by an option [14:31:43] oh! [14:31:48] i haven't read the new change yet, looking [14:31:50] i just saw some comments [14:32:16] ottomata: https://gerrit.wikimedia.org/r/#/c/283411/8/manifests/init.pp :) [14:32:33] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 78 failures [14:33:17] elukey: i'm fine with that solution! [14:33:36] make it clear in the comment there that if someone sets that to true, they had better have the rsyslog and logrotate modules! [14:34:04] and that it is not likely to work outside of operations/puppet [14:34:54] (03CR) 10Ottomata: [C: 031] Use require_package in archiva role [puppet] - 10https://gerrit.wikimedia.org/r/283941 (owner: 10Muehlenhoff) [14:35:05] ottomata: alll right! [14:36:07] (03CR) 10Ottomata: "+1 for this idea, with good comments about what the dependencies are if you set this flag." [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) (owner: 10Elukey) [14:36:23] (03PS4) 10Jcrespo: heartbeat: run the script if enabled, kill it if not [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/283974 (owner: 10Giuseppe Lavagetto) [14:37:52] (03CR) 10Jcrespo: [C: 032] heartbeat: run the script if enabled, kill it if not [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/283974 (owner: 10Giuseppe Lavagetto) [14:44:06] (03PS9) 10Elukey: Override kafkatee's default logrotate/rsyslog configuration. [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) [14:44:07] 06Operations, 10hardware-requests: rack and set up graphite1003 - https://phabricator.wikimedia.org/T132717#2214093 (10fgiunchedi) a:05fgiunchedi>03Cmjohnson machine is up, I found no obvious way to do what I wanted (mixed raid/lvm) in trusty debian-installer/partman, anyways @Cmjohnson anything left on yo... [14:49:06] (03CR) 10Ottomata: [C: 031] "I might have set the default for $configure_rsyslog to false, but for this kafkatee module, I don't really care. :)" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) (owner: 10Elukey) [14:49:30] (03PS1) 10Jcrespo: [WIP] Install pt-heartbeat-wikimedia on all relevant servers [puppet] - 10https://gerrit.wikimedia.org/r/283979 [14:51:47] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Install pt-heartbeat-wikimedia on all relevant servers [puppet] - 10https://gerrit.wikimedia.org/r/283979 (owner: 10Jcrespo) [14:52:16] yep, that was expected [14:52:50] <_joe_> jynus: so I take you're working on it then :) [14:58:01] (03PS1) 10Filippo Giunchedi: install_server: graphite1003 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/283980 [14:59:47] (03PS2) 10Filippo Giunchedi: install_server: graphite1003 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/283980 [15:00:01] ottomata: any news regarding the "Throughput of EventLogging NavigationTiming events" alert? [15:00:03] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: graphite1003 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/283980 (owner: 10Filippo Giunchedi) [15:00:09] it says "WARNING: 100.00% of data under the warning threshold [1.0] " right now [15:00:14] whatever that means.. [15:00:33] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 72920 bytes in 7.579 second response time [15:01:04] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.945 second response time [15:04:11] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2214142 (10elukey) Checked the netboot.cfg blame list and the raid was configured on purpose to leverage all the 4 disks: ``` elukey@stat100... [15:04:24] Hey, I need to be a member of this project: https://phabricator.wikimedia.org/project/view/13/ It says I need a person from Ops to do that [15:04:27] <_joe_> paravoid: it means that no events are flowing, I guess [15:04:47] 06Operations, 10Mail, 10OTRS, 10WMDE-Fundraising-Software: add WMDE mx's to SpamAssassin trusted hosts to fix SPF softfails - https://phabricator.wikimedia.org/T83499#2214146 (10Dzahn) [15:04:51] I would appreciate if you do it for me [15:05:02] that's probably inaccurate [15:05:22] Amir1: you should ask andre [15:05:29] and we should change that description to not mention ops [15:05:38] oh, thanks :) [15:05:51] (the latter is now done) [15:06:41] (03PS2) 10Jcrespo: [WIP] Install pt-heartbeat-wikimedia on all relevant servers [puppet] - 10https://gerrit.wikimedia.org/r/283979 [15:07:53] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Install pt-heartbeat-wikimedia on all relevant servers [puppet] - 10https://gerrit.wikimedia.org/r/283979 (owner: 10Jcrespo) [15:08:30] Krenair: hey, your name is in phabricator project, Can you add me to the https://phabricator.wikimedia.org/project/view/13/ ? [15:09:04] (03PS3) 10Jcrespo: [WIP] Install pt-heartbeat-wikimedia on all relevant servers [puppet] - 10https://gerrit.wikimedia.org/r/283979 [15:09:44] PROBLEM - Disk space on silver is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=57%) [15:09:54] Amir1, done [15:10:04] <_joe_> looking at silver now... [15:10:05] awesome [15:10:11] thanks Krenair :) [15:10:35] (03CR) 10Ottomata: [C: 031] Setup meitnerium as the jessie-based archiva host [puppet] - 10https://gerrit.wikimedia.org/r/283956 (owner: 10Muehlenhoff) [15:10:48] <_joe_> it already recovered [15:10:56] <_joe_> I guess it was some huge tmp file [15:11:04] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:11:18] (03PS2) 10Faidon Liambotis: Enable base::firewall for rhenium [puppet] - 10https://gerrit.wikimedia.org/r/283961 (owner: 10Muehlenhoff) [15:11:45] RECOVERY - Disk space on silver is OK: DISK OK [15:11:47] (03CR) 10Faidon Liambotis: [C: 032] Enable base::firewall for rhenium [puppet] - 10https://gerrit.wikimedia.org/r/283961 (owner: 10Muehlenhoff) [15:12:24] (03PS2) 10Giuseppe Lavagetto: elasticsearch: enable ganglia in codfw [puppet] - 10https://gerrit.wikimedia.org/r/283960 [15:12:48] (03PS4) 10Jcrespo: [WIP] Install pt-heartbeat-wikimedia on all relevant servers [puppet] - 10https://gerrit.wikimedia.org/r/283979 [15:13:00] (03CR) 10Faidon Liambotis: [V: 032] Enable base::firewall for rhenium [puppet] - 10https://gerrit.wikimedia.org/r/283961 (owner: 10Muehlenhoff) [15:13:04] useless jenkins [15:13:56] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Install pt-heartbeat-wikimedia on all relevant servers [puppet] - 10https://gerrit.wikimedia.org/r/283979 (owner: 10Jcrespo) [15:14:04] 06Operations, 06Commons, 10MediaWiki-Page-deletion: Unable to delete pages on commons - https://phabricator.wikimedia.org/T132921#2214186 (10Steinsplitter) [15:18:38] (03PS3) 10Giuseppe Lavagetto: elasticsearch: enable ganglia in codfw [puppet] - 10https://gerrit.wikimedia.org/r/283960 [15:19:05] 06Operations, 06Commons, 10MediaWiki-Page-deletion: Unable to delete pages on commons - https://phabricator.wikimedia.org/T132921#2214201 (10Steinsplitter) [15:20:59] 06Operations, 10hardware-requests: rack and set up graphite1003 - https://phabricator.wikimedia.org/T132717#2214205 (10Cmjohnson) @fgiunchedi nothing left on my end. [15:21:40] (03PS5) 10Jcrespo: Install pt-heartbeat-wikimedia on all relevant servers [puppet] - 10https://gerrit.wikimedia.org/r/283979 [15:21:57] 06Operations, 06Commons, 10MediaWiki-Page-deletion: Unable to delete pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214206 (10matmarex) [15:23:34] (03CR) 10Giuseppe Lavagetto: [C: 032] "We need to have ganglia data on those servers." [puppet] - 10https://gerrit.wikimedia.org/r/283960 (owner: 10Giuseppe Lavagetto) [15:26:49] (03PS1) 10Hashar: beta: drop mobile cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283986 (https://phabricator.wikimedia.org/T130473) [15:27:33] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0] [15:30:39] (03CR) 10Alex Monk: [C: 031] beta: drop mobile cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283986 (https://phabricator.wikimedia.org/T130473) (owner: 10Hashar) [15:31:38] (03PS1) 10Hashar: dnsmasq: drop beta cluster mobile public IP [puppet] - 10https://gerrit.wikimedia.org/r/283987 (https://phabricator.wikimedia.org/T130473) [15:32:16] 06Operations, 06Commons, 06Multimedia, 10Traffic: Commons API fails (413 error) to upload file within 100MB threshold - https://phabricator.wikimedia.org/T86436#2214250 (10Slowking4) ok i have uploaded this 69.29 MB volume using metadata from IA uploader and uploading using chunked uploads. could someone p... [15:41:24] (03CR) 10Alex Monk: [C: 031] dnsmasq: drop beta cluster mobile public IP [puppet] - 10https://gerrit.wikimedia.org/r/283987 (https://phabricator.wikimedia.org/T130473) (owner: 10Hashar) [15:41:44] (03PS1) 10BBlack: basic acme-setup script + acme::init [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [15:41:49] 06Operations, 10ops-eqiad: db1047.eqiad.wmnet: slot=3 failed - https://phabricator.wikimedia.org/T132917#2214037 (10Cmjohnson) Dis replaced and is rebuilding. The task said db1047 but the problem disk was in 1046. [15:41:57] 06Operations, 10ops-eqiad: db1046.eqiad.wmnet: slot=3 failed - https://phabricator.wikimedia.org/T132917#2214260 (10Cmjohnson) [15:44:08] (03CR) 10Giuseppe Lavagetto: [C: 031] "Minor nitpick but LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283979 (owner: 10Jcrespo) [15:44:14] (03PS2) 10BBlack: basic acme-setup script + acme::init [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [15:48:18] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, 07Elasticsearch: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2214275 (10Gehel) [15:48:26] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [15:49:56] (03PS3) 10BBlack: basic acme-setup script + acme::init [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [15:52:01] (03PS6) 10Jcrespo: Install pt-heartbeat-wikimedia on all relevant servers [puppet] - 10https://gerrit.wikimedia.org/r/283979 [15:52:14] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] switch configuration - https://phabricator.wikimedia.org/T132923#2214287 (10Papaul) [15:54:19] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, 07Elasticsearch: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2214306 (10Gehel) It is unclear from the documentation if the unicast / multicast settings are hot reloada... [15:56:24] (03PS4) 10BBlack: basic acme-setup script + acme::init [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [15:57:01] (03PS1) 10Filippo Giunchedi: graphite: add graphite1003 [puppet] - 10https://gerrit.wikimedia.org/r/283989 [16:03:36] 06Operations: Remove cache files from Planet to regenerate fr.planet.wikimedia.org page - https://phabricator.wikimedia.org/T132924#2214318 (10Dereckson) [16:03:47] 06Operations, 10Wikimedia-Planet: Remove cache files from Planet to regenerate fr.planet.wikimedia.org page - https://phabricator.wikimedia.org/T132924#2214330 (10Dereckson) [16:04:29] (03PS2) 10Muehlenhoff: Use require_package in archiva role [puppet] - 10https://gerrit.wikimedia.org/r/283941 [16:06:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use require_package in archiva role [puppet] - 10https://gerrit.wikimedia.org/r/283941 (owner: 10Muehlenhoff) [16:12:12] PROBLEM - puppet last run on titanium is CRITICAL: CRITICAL: puppet fail [16:12:40] (03PS1) 10Muehlenhoff: Followup fix for require_package fix for archiva [puppet] - 10https://gerrit.wikimedia.org/r/283991 [16:20:56] (03CR) 10Muehlenhoff: [C: 032 V: 032] Followup fix for require_package fix for archiva [puppet] - 10https://gerrit.wikimedia.org/r/283991 (owner: 10Muehlenhoff) [16:24:31] RECOVERY - puppet last run on titanium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:38:17] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2214476 (10Nuria) [16:39:37] (03PS5) 10BBlack: basic acme-setup script + acme::init [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [16:41:54] (03PS6) 10BBlack: basic acme-setup script + acme::init [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [17:00:42] (03CR) 10Faidon Liambotis: [C: 031] "LGTM :)" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) (owner: 10Elukey) [17:01:49] (03CR) 10Alexandros Kosiaris: [C: 032] servermon: Add Krenair (Alex Monk) to servermon users [puppet] - 10https://gerrit.wikimedia.org/r/283623 (owner: 10Alexandros Kosiaris) [17:01:54] (03PS2) 10Alexandros Kosiaris: servermon: Add Krenair (Alex Monk) to servermon users [puppet] - 10https://gerrit.wikimedia.org/r/283623 [17:01:59] (03CR) 10Alexandros Kosiaris: [V: 032] servermon: Add Krenair (Alex Monk) to servermon users [puppet] - 10https://gerrit.wikimedia.org/r/283623 (owner: 10Alexandros Kosiaris) [17:06:06] (03CR) 10Elukey: [C: 032] Override kafkatee's default logrotate/rsyslog configuration. [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) (owner: 10Elukey) [17:17:30] (03PS1) 10Elukey: Update the kafkatee submodule with the latest rsyslog/logrotate updates. [puppet] - 10https://gerrit.wikimedia.org/r/283998 (https://phabricator.wikimedia.org/T132324) [17:19:10] (03CR) 10Elukey: [C: 032 V: 032] Update the kafkatee submodule with the latest rsyslog/logrotate updates. [puppet] - 10https://gerrit.wikimedia.org/r/283998 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [17:23:43] I am reviewing a change on palladium so if you need to merge let me know, I am blocking the line :) [17:28:11] RECOVERY - RAID on db1046 is OK: OK: optimal, 1 logical, 2 physical [17:31:37] !log restarted kafkatee on oxygen after module upgrade (was stuck from this morning due to a kafka broker restart) [17:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:34:28] 06Operations, 10ops-codfw, 06Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2193908 (10RobH) So chatting with Papaul, this host shows ssl2001 as the hostname when booted. This means this host has never been installed as labtestneutron2001.... [17:37:49] 06Operations, 10ops-codfw, 06Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2193908 (10Andrew) Confirmed, this host is designated for labtest use but so far we have never used it. I can reimage shortly if that makes your lives less confusing :) [17:40:59] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2214739 (10Papaul) [17:44:31] (03PS1) 10BBlack: update-ocsp: fix command exceptions [puppet] - 10https://gerrit.wikimedia.org/r/284001 [17:45:12] (03PS4) 10Dzahn: create sslcert::letsencrypt::simple, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) [17:45:14] (03PS1) 10BBlack: update-ocsp: fix minor pep8 issues [puppet] - 10https://gerrit.wikimedia.org/r/284002 [17:45:53] 06Operations, 10Analytics, 06WMF-Legal, 07Privacy: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831#2214758 (10ZhouZ) [17:46:01] (03CR) 10Dzahn: "@Legotkm done, i copy/pasted it from a separate file into the script" [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) (owner: 10Dzahn) [17:49:25] (03CR) 10BBlack: "::simple should probably be ::init, as this installs common files to work on these certs, rather than defining individual certs." [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) (owner: 10Dzahn) [17:54:42] 06Operations, 10ops-eqiad: audit/remove two cross-connection patch cables - https://phabricator.wikimedia.org/T132945#2214842 (10RobH) [17:56:42] (03PS1) 10Papaul: DNS:Adding prodcution DNS entries for conf200[1-3] Bug:T131959 [dns] - 10https://gerrit.wikimedia.org/r/284004 (https://phabricator.wikimedia.org/T131959) [17:58:23] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:51] hmm, joal ^ ? [18:00:51] ottomata: looking, but bizarre, nothing ongoing [18:01:04] (03CR) 10Dzahn: [C: 031] added .gitreview file [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 (owner: 10Mschon) [18:01:56] weird ottomata, cluster is heavily loaded [18:02:30] ottomata: ^ you dont mind if we add .gitreview file to kafka, do you [18:02:32] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:02:59] (03PS2) 10Papaul: DNS:Adding prodcution DNS entries for conf200[1-3] Bug:T131959 [dns] - 10https://gerrit.wikimedia.org/r/284004 (https://phabricator.wikimedia.org/T131959) [18:03:04] ottomata: seems compaction related :( [18:03:51] joal: should we do anything? [18:03:54] mutante: , not at all! [18:04:00] ottomata: 'k, cool [18:04:12] (03CR) 10Dzahn: [C: 032] added .gitreview file [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 (owner: 10Mschon) [18:04:22] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:04:32] ottomata: I don't know if there's anything we can do [18:04:32] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:04:54] joal: ok [18:05:03] ottomata: one big compaction just finished from what I see [18:05:11] (03CR) 10Dzahn: [V: 032] added .gitreview file [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 (owner: 10Mschon) [18:05:25] joal: ottomata: those 500s seem to be mostly cass timeouts [18:05:44] rb's recorded around 10k of them in the last hour [18:05:49] mobrovac: yeah, 500 we know [18:06:08] mobrovac: there are unhealthy checks appearing currently [18:06:23] mobrovac: cassandra was overloading the machines [18:06:47] joal: these nrpe fails are a by-product of that [18:07:22] by waiting on timeouts, all connections are exhausted per worker, so the check is unable to connect [18:07:36] makes sense mobrovac [18:07:51] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#2214934 (10Andrew) a:03Andrew [18:07:52] (03PS1) 10Dzahn: kafka: bump submodule to add .gitreview [puppet] - 10https://gerrit.wikimedia.org/r/284007 [18:08:24] mobrovac: Thanks for explanations [18:08:31] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#2214936 (10yuvipanda) >>! In T102367#1372878, @Nemo_bis wrote: >> Should be fairly simple to do. > > I doubt this is possible. Several tools do not... [18:11:58] (03PS1) 10Yuvipanda: Add a version parameter to tools manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/284008 [18:14:59] (03PS1) 10Papaul: DHCP: Adding MAC address entries for conf200[1-3] Bug: T131959 [puppet] - 10https://gerrit.wikimedia.org/r/284010 (https://phabricator.wikimedia.org/T131959) [18:15:41] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2073537 (10yuvipanda) This seems not particularly useful as a data point, since it is just counting: 1. Tools that are doing HTTPS redirects themselves... [18:15:57] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2214957 (10yuvipanda) [18:17:00] (03CR) 10Dzahn: [C: 032] DNS:Adding prodcution DNS entries for conf200[1-3] Bug:T131959 [dns] - 10https://gerrit.wikimedia.org/r/284004 (https://phabricator.wikimedia.org/T131959) (owner: 10Papaul) [18:18:13] (03PS2) 10Papaul: DHCP: Adding MAC address entries for conf200[1-3] Bug: T131959 [puppet] - 10https://gerrit.wikimedia.org/r/284010 (https://phabricator.wikimedia.org/T131959) [18:20:30] (03CR) 10Legoktm: "Thanks :) I also submitted https://github.com/diafygi/acme-tiny/pull/120 upstream." [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) (owner: 10Dzahn) [18:20:32] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2214982 (10Papaul) [18:21:03] (03CR) 10Dzahn: [C: 032] "@Mschon this is the "update submodule" change that needs to happen in the main repo after making a change in the submodule. you can see in" [puppet] - 10https://gerrit.wikimedia.org/r/284007 (owner: 10Dzahn) [18:21:07] (03PS1) 10Jcrespo: Revert "Depool parsercaches #1 for cloning pc1004 -> pc2004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284013 [18:23:34] (03PS3) 10Dzahn: DHCP: Adding MAC address entries for conf200[1-3] Bug: T131959 [puppet] - 10https://gerrit.wikimedia.org/r/284010 (https://phabricator.wikimedia.org/T131959) (owner: 10Papaul) [18:23:48] (03CR) 10Dzahn: [C: 032] DHCP: Adding MAC address entries for conf200[1-3] Bug: T131959 [puppet] - 10https://gerrit.wikimedia.org/r/284010 (https://phabricator.wikimedia.org/T131959) (owner: 10Papaul) [18:24:45] (03CR) 10Jcrespo: [C: 032] Revert "Depool parsercaches #1 for cloning pc1004 -> pc2004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284013 (owner: 10Jcrespo) [18:24:48] (03CR) 10Dzahn: [V: 032] DHCP: Adding MAC address entries for conf200[1-3] Bug: T131959 [puppet] - 10https://gerrit.wikimedia.org/r/284010 (https://phabricator.wikimedia.org/T131959) (owner: 10Papaul) [18:27:25] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool pc1004 (duration: 00m 36s) [18:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:37] 06Operations, 10Wikimedia-Planet: Remove cache files from Planet to regenerate fr.planet.wikimedia.org page - https://phabricator.wikimedia.org/T132924#2215048 (10Dzahn) a:03Dzahn [18:30:52] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [18:32:47] (03PS1) 10Papaul: adding install params for conf200[1-3] Bug:T131959 [puppet] - 10https://gerrit.wikimedia.org/r/284014 (https://phabricator.wikimedia.org/T131959) [18:33:47] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2215051 (10Papaul) [18:35:59] 06Operations, 06Discovery, 10Salt, 03Discovery-Wikidata-Query-Service-Sprint: Failed to deploy WDQS - https://phabricator.wikimedia.org/T132952#2215066 (10Gehel) [18:36:52] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [18:37:04] 06Operations, 10Wikimedia-Planet: Remove cache files from Planet to regenerate fr.planet.wikimedia.org page - https://phabricator.wikimedia.org/T132924#2215067 (10Dzahn) I deleted the cache files (rm in /var/cache/planet/fr/ but not the ./sources/ subdir) for the "fr" planet and and then ran a manual feed upd... [18:37:42] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool pc2004 (duration: 00m 25s) [18:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:21] (03PS1) 10Jcrespo: Depool pc1005 and pc2005 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284017 [18:40:14] (03PS2) 10Dzahn: adding install params for conf200[1-3] Bug:T131959 [puppet] - 10https://gerrit.wikimedia.org/r/284014 (https://phabricator.wikimedia.org/T131959) (owner: 10Papaul) [18:40:23] (03CR) 10Dzahn: [C: 032] adding install params for conf200[1-3] Bug:T131959 [puppet] - 10https://gerrit.wikimedia.org/r/284014 (https://phabricator.wikimedia.org/T131959) (owner: 10Papaul) [18:40:45] (03CR) 10Jcrespo: [C: 032] Depool pc1005 and pc2005 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284017 (owner: 10Jcrespo) [18:41:01] (03CR) 10Dzahn: [V: 032] adding install params for conf200[1-3] Bug:T131959 [puppet] - 10https://gerrit.wikimedia.org/r/284014 (https://phabricator.wikimedia.org/T131959) (owner: 10Papaul) [18:42:28] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool pc2005 (duration: 00m 26s) [18:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:48] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] switch configuration - https://phabricator.wikimedia.org/T132923#2215074 (10Dzahn) DHCP and DNS have been added, but needs switch config now [18:43:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool pc1005 (duration: 00m 28s) [18:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:43:53] 06Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2215079 (10Ottomata) [18:45:10] (03PS1) 10Catrope: Enable Flow opt-in beta feature on frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284020 (https://phabricator.wikimedia.org/T132914) [18:49:32] 06Operations: Remove unused Samba packages - https://phabricator.wikimedia.org/T132915#2213985 (10Dzahn) yes please, i have noticed these in the past when doing upgrades and also wanted to remove them. that nagios-plugins-standard pulls them in explains a lot, yea. please kill [18:49:51] 06Operations: Remove unused Samba packages - https://phabricator.wikimedia.org/T132915#2215110 (10Dzahn) p:05Triage>03Normal [18:50:54] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: Set SPF (... -all) for toolserver.org - https://phabricator.wikimedia.org/T131930#2183433 (10Dzahn) a:03Mschon [18:52:04] 06Operations: Investigate cp1044's strange Ganglia graphs - https://phabricator.wikimedia.org/T132859#2212399 (10Dzahn) so this is _just_ for cp1044 , right? [18:53:55] 06Operations: Investigate cp1044's strange Ganglia graphs - https://phabricator.wikimedia.org/T132859#2215140 (10Dzahn) killed gmond on cp1044 (which was running varnishstat).. started with /etc/init.d/ganglia-monitor start [18:57:23] 06Operations, 10Wikimedia-Planet: Remove cache files from Planet to regenerate fr.planet.wikimedia.org page - https://phabricator.wikimedia.org/T132924#2215154 (10Dzahn) p:05Triage>03Low [18:57:54] 06Operations: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#2215158 (10Dzahn) a:05Dzahn>03None [18:59:08] 06Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2215160 (10Dzahn) @Muehlenhoff let's talk about when we consider this one to be resolved. Technically the laptop and YubiHSM is there but i think you could not connect to it yet, right? [19:00:01] ottomata: ping [19:03:49] 06Operations: Investigate cp1044's strange Ganglia graphs - https://phabricator.wikimedia.org/T132859#2215201 (10Dzahn) does not look like it fixed it. cp1044 is all green in Icinga [19:06:16] 06Operations, 10Analytics-EventLogging, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2215217 (10Krinkle) @ottomata It looks to be back up and working fine, but the restart appears to have a change in traffic. andrewbogott: is it possible to tail apache weblogs in the beta cluster? if so, where? [19:07:35] dr0ptp4kt: I don't know much about the internals of the beta cluster. Best to ask in #wikimedia-releng [19:07:43] andrewbogott: thx [19:08:15] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2215237 (10RobH) [19:08:17] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] switch configuration - https://phabricator.wikimedia.org/T132923#2215235 (10RobH) 05Open>03Resolved done! [19:11:00] 06Operations: Investigate cp1044's strange Ganglia graphs - https://phabricator.wikimedia.org/T132859#2212399 (10elukey) @Dzahn: cp1044 is one of the new hosts with Varnish 4, there was a problem with gmond that ema fixed a couple of weeks ago, not sure about this one though! [19:16:18] (03CR) 10Yuvipanda: [C: 032] Add a version parameter to tools manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/284008 (owner: 10Yuvipanda) [19:16:53] (03Merged) 10jenkins-bot: Add a version parameter to tools manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/284008 (owner: 10Yuvipanda) [19:32:22] (03PS7) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 [19:32:24] (03PS1) 10Andrew Bogott: Horizon: Add some help_text to the login fields [puppet] - 10https://gerrit.wikimedia.org/r/284025 (https://phabricator.wikimedia.org/T132694) [19:34:15] 06Operations: Investigate cp1044's strange Ganglia graphs - https://phabricator.wikimedia.org/T132859#2215307 (10Dzahn) Aha, that's a good hint. @ema is it possible this is different from others because it was used to test the fix or something? [19:49:27] (03PS1) 10Yuvipanda: Add support for v2 manifests, which are written by webservice-new [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/284026 [19:49:29] (03PS1) 10Yuvipanda: Bump debian version [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/284027 [19:55:03] Krinkle: yt? [19:55:10] ori: yep [19:55:13] ottomata: yep [19:56:09] looking at statsv code [19:56:12] it does [19:56:18] consumer = topic.get_simple_consumer() [19:56:33] no args, so default consumer group is NOne [19:57:34] and [19:57:35] :param auto_commit_enable: If true, periodically commit to kafka the [19:57:35] offset of messages already fetched by this consumer. This also [19:57:35] requires that `consumer_group` is not `None`. [19:57:55] default is auto_commit_enable=False, anyway [19:58:08] and also [19:58:09] default [19:58:10] auto_offset_reset=OffsetType.EARLIEST, [19:58:12] so [19:58:25] your statsv consumer is not committing any offsets, so it can't pick up where it left off [19:58:26] and [19:58:46] since the default offset_reset (when no know offset can be found) is earliest, it will consume from beginning of stream [19:59:57] 07Puppet, 10Beta-Cluster-Infrastructure: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2215384 (10hashar) The self puppetmaster also tend to auto sign the client certificates. So in theory any labs instance could point to the beta puppet master and... [20:00:46] 06Operations, 10Monitoring, 07Icinga, 07Need-volunteer: check_puppetrun: print "agent disabled" reason - https://phabricator.wikimedia.org/T98481#2215385 (10Dzahn) [20:01:10] 07Puppet, 10Beta-Cluster-Infrastructure: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2215387 (10yuvipanda) So if you use resource collection and care about security (you could block access to puppetmaster from outside the project with security gr... [20:02:14] ottomata: define beginning [20:02:27] ottomata: This was not the case before. Afaik we've restarted before without this being an issue. [20:02:32] maybe the behaviour was different in the old library? [20:03:11] Either way, this should not be the case next time. Obviously we have no interest in re-emitting all past data points to Graphite. it's bad enough that we can't correlate timestamps. Having it replace old data every time would be worse. [20:03:21] replay* [20:05:59] (03PS1) 10Yuvipanda: tools: Don't touch webservice-new webservices from webservice [puppet] - 10https://gerrit.wikimedia.org/r/284032 [20:06:01] Krinkle: if you don't care about catcyhing up from a committed offset [20:06:02] you can just set [20:06:10] (03PS2) 10Yuvipanda: tools: Don't touch webservice-new webservices from webservice [puppet] - 10https://gerrit.wikimedia.org/r/284032 [20:06:46] auto_offset_reset=-1 [20:06:54] when you instantiate the consumer [20:07:08] consumer = topic.get_simple_consumer(auto_offset_reset=-1) [20:07:25] https://github.com/wikimedia/analytics-statsv/commit/bb462f359ba1c420f3f881379c6b67ff97eeb7b3 [20:07:35] Krinkle: beginning is earliest in kafka [20:07:36] The old code used a consumer name [20:07:37] so, 1 week [20:07:37] ago [20:07:51] which presumably automatically makes the kafka server track offsets for that consumer? [20:08:12] no, i don't think it did, but, i do think that the default auto_offset_reset may have changed [20:08:15] it may have used to be -1 [20:08:26] which is latests [20:08:28] latest [20:08:56] yup, looking at old code, they changed it [20:09:14] default usd to be latest, it is earliest in this new pykafka version [20:10:40] Yeah, I'm okay either way with either proper continuation or just skipping and starting fresh. Both have down sides (one skews hitrate momentarily but keeps totals accurate, esp. useful for error metrics; the other keeps hitrate sane but will make total lower than it should be). [20:10:47] As long as it doesn't go by the same data twice. [20:12:27] (03CR) 10Krinkle: tools: Don't touch webservice-new webservices from webservice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284032 (owner: 10Yuvipanda) [20:13:09] (03CR) 10Yuvipanda: tools: Don't touch webservice-new webservices from webservice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284032 (owner: 10Yuvipanda) [20:13:31] (03PS3) 10Yuvipanda: tools: Don't touch webservice-new webservices from webservice [puppet] - 10https://gerrit.wikimedia.org/r/284032 [20:13:48] (03PS4) 10Yuvipanda: tools: Don't touch webservice-new webservices from webservice [puppet] - 10https://gerrit.wikimedia.org/r/284032 [20:14:00] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Don't touch webservice-new webservices from webservice [puppet] - 10https://gerrit.wikimedia.org/r/284032 (owner: 10Yuvipanda) [20:14:52] ottomata: I'm not able to test this myself, but should we commit a change to statsv then? [20:17:33] 06Operations, 06Discovery, 10Salt, 03Discovery-Wikidata-Query-Service-Sprint: Failed to deploy WDQS - https://phabricator.wikimedia.org/T132952#2215452 (10Gehel) I started trying to understand with the help of https://wikitech.wikimedia.org/wiki/Trebuchet#Troubleshooting. `salt-call deploy.fetch 'wdqs/wdq... [20:19:32] ottomata: When I run a simplified version of statsv's code on stat1002 the first message I get is from "2016-04-18T20:18:44" [20:19:39] so that doesn't look like earlier. [20:20:04] https://gist.github.com/Krinkle/a9d850d912a1d5bb40a44c8c29c0827b [20:20:20] (03PS2) 10Yuvipanda: Add support for v2 manifests, which are written by webservice-new [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/284026 [20:21:04] 06Operations, 10Wikimedia-Planet: Remove cache files from Planet to regenerate fr.planet.wikimedia.org page - https://phabricator.wikimedia.org/T132924#2215475 (10Dereckson) 05Open>03Resolved Harmonia has just checked, the post isn't published anymore on planet. Thanks. [20:21:25] ah Krinkle old version of pykafka there too [20:21:27] upgrading... [20:21:35] (03PS2) 10BBlack: update-ocsp: fix command exceptions [puppet] - 10https://gerrit.wikimedia.org/r/284001 [20:21:42] (03CR) 10BBlack: [C: 032] update-ocsp: fix command exceptions [puppet] - 10https://gerrit.wikimedia.org/r/284001 (owner: 10BBlack) [20:21:56] but Krinkle, yeah, if you update statsv code with auto_offset_reset=-1, should work. [20:23:51] (03PS2) 10BBlack: update-ocsp: fix minor pep8 issues [puppet] - 10https://gerrit.wikimedia.org/r/284002 [20:23:51] ottomata: https://wikitech.wikimedia.org/w/index.php?search="KafkaConsumer"+OR+"pykafka" [20:23:58] (03CR) 10BBlack: [C: 032 V: 032] update-ocsp: fix minor pep8 issues [puppet] - 10https://gerrit.wikimedia.org/r/284002 (owner: 10BBlack) [20:24:03] Looks like we should update snippets in that case [20:25:03] https://github.com/search?l=python&q=%22KafkaConsumer%22+OR+%22pykafka%22+%40wikimedia+-repo%3Awikimedia%2Foperations-debs-python-pykafka&ref=searchresults&type=Code&utf8=%E2%9C%93 [20:25:57] (03PS2) 10BBlack: OCSP Stapling: make icinga alerts more aggressive [puppet] - 10https://gerrit.wikimedia.org/r/283767 (https://phabricator.wikimedia.org/T132835) [20:26:21] (03CR) 10BBlack: [C: 032 V: 032] OCSP Stapling: make icinga alerts more aggressive [puppet] - 10https://gerrit.wikimedia.org/r/283767 (https://phabricator.wikimedia.org/T132835) (owner: 10BBlack) [20:26:52] ottomata: Earlier packet now on stat1002 with that code is from 2016-04-10T09:02: [20:26:54] 8 days [20:27:03] aye, makes sense [20:27:15] ottomata: What is that based on? [20:29:55] (03PS1) 10Faidon Liambotis: phabricator/phab_epipe: decode base64/qp body parts [puppet] - 10https://gerrit.wikimedia.org/r/284067 [20:29:57] (03PS1) 10Faidon Liambotis: phabricator/phab_epipe: use get_content_type() [puppet] - 10https://gerrit.wikimedia.org/r/284068 [20:29:59] (03PS1) 10Faidon Liambotis: phabricator/phab_epipe: use tempfile to write /tmp files [puppet] - 10https://gerrit.wikimedia.org/r/284069 [20:30:09] chasemp: ^^^ [20:30:43] ok thanks [20:31:09] the first one is the fix for the issue at hand, the other two are unrelated cleanups [20:32:50] yeah looks good, I only trying to figure out where my head was at for https://gerrit.wikimedia.org/r/#/c/284068/1/modules/phabricator/files/phab_epipe.py [20:33:16] (03CR) 10Rush: [C: 031] "makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/284067 (owner: 10Faidon Liambotis) [20:33:48] (03CR) 10Rush: [C: 031] phabricator/phab_epipe: use get_content_type() [puppet] - 10https://gerrit.wikimedia.org/r/284068 (owner: 10Faidon Liambotis) [20:34:26] (03CR) 10Rush: [C: 031] phabricator/phab_epipe: use tempfile to write /tmp files [puppet] - 10https://gerrit.wikimedia.org/r/284069 (owner: 10Faidon Liambotis) [20:34:48] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2206880 (10hashar) The `/etc/puppet/puppet.conf` file is generated by concatenating files und... [20:35:00] (03CR) 10Faidon Liambotis: [C: 032] phabricator/phab_epipe: decode base64/qp body parts [puppet] - 10https://gerrit.wikimedia.org/r/284067 (owner: 10Faidon Liambotis) [20:35:15] (03CR) 10Faidon Liambotis: [C: 032] phabricator/phab_epipe: use get_content_type() [puppet] - 10https://gerrit.wikimedia.org/r/284068 (owner: 10Faidon Liambotis) [20:35:23] (03CR) 10Faidon Liambotis: [C: 032] phabricator/phab_epipe: use tempfile to write /tmp files [puppet] - 10https://gerrit.wikimedia.org/r/284069 (owner: 10Faidon Liambotis) [20:35:49] chasemp: do you have a way to test all that perhaps? [20:35:54] (03PS1) 10Yuvipanda: Bump debian version number [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/284072 [20:37:15] (03CR) 10Yuvipanda: [C: 032 V: 032] Bump debian version number [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/284072 (owner: 10Yuvipanda) [20:37:44] paravoid: most of it yes and it will take effect when it lands on iridium I believe, but making sure some of the vendor emails are parsed not sure how to mimic, I'll run through standard functionality np [20:38:07] I forced-ran puppet on iridium, so it's there already [20:38:35] kk [20:39:13] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2215581 (10hashar) Random trace on deployment-cache-text04 ``` Info: Applying configuration v... [20:39:24] if we could make that accept maint-announce mails the same way we could kill RT [20:40:06] importing maint-announce into phab sounds very ugly, but it may be the pragmatic solution [20:40:46] ideally we'd have a different system for tracking those [20:41:13] currently the RT ticket just tells us to manually check it and put it on calendar [20:41:25] maybe we can just send it to a list [20:41:30] and do the same thing and be done ? [20:42:13] i tried to make phab accept this before but it had to be reverted [20:43:08] 06Operations, 10Wikimedia-Apache-configuration: Use gzip transfer encoding for SVG files from MediaWiki static assets - https://phabricator.wikimedia.org/T63442#2215611 (10Krinkle) [20:43:46] 06Operations, 10Traffic: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2215614 (10Krinkle) [20:44:38] roughly the way I think it could work based on teh current is: * we have a space for maint announce things * we have an email that accepts to maint-announce@ and crates tasks in that space (these both work) * we accept emails to maint-announce and phab_epipe.py rewrites the sender as the bot and adds the actual sender to a header [20:45:10] afair that last bit is not done, i.e. when $vendor sends to maint-announce now we passthrough and phab says it doesn't know the sender [20:45:40] (03CR) 10Yuvipanda: [C: 032] Add support for v2 manifests, which are written by webservice-new [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/284026 (owner: 10Yuvipanda) [20:45:47] (03Abandoned) 10Yuvipanda: Bump debian version [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/284027 (owner: 10Yuvipanda) [20:46:18] (03Merged) 10jenkins-bot: Add support for v2 manifests, which are written by webservice-new [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/284026 (owner: 10Yuvipanda) [20:46:46] it's not like we do much with maint-announces [20:47:04] and if you don't track the foreign ticket id, every single mail will be a separate task [20:47:06] i already did that sender rewriting in exim, that made phab actually take them [20:47:12] just target maint-announce into a mailing list I'd say [20:47:13] but then there was no app in phab configured to take it [20:47:34] mutante: iiuc you changed the //destination// not teh source address [20:47:43] paravoid: i think so too, a list would be good enough [20:49:24] (03PS1) 10Yuvipanda: tools: Ensure the latest version of the webservice package [puppet] - 10https://gerrit.wikimedia.org/r/284073 [20:49:34] and yeah w/o tracking some foreign maint id it will be a mess, which I think we talked about at some point and said was acceptable but I have no preference [20:50:17] chasemp: yea, true. it was about to and envelope-to not having the .phabricator. part when the maint-announce arrived there [20:50:29] and the other things you said of course [20:50:30] (03PS2) 10Yuvipanda: tools: Ensure the latest version of the webservice package [puppet] - 10https://gerrit.wikimedia.org/r/284073 [20:50:43] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Ensure the latest version of the webservice package [puppet] - 10https://gerrit.wikimedia.org/r/284073 (owner: 10Yuvipanda) [20:51:59] 06Operations, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2215627 (10Dzahn) 13:46 < mutante> currently the RT ticket just tells us to manually check it and put it on calendar 13:46 < mutante> maybe we can just send it to a list 13:46 < mutante> and do t... [20:53:17] 06Operations, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2215633 (10Dzahn) I'd tend to decline this, make a ticket to create a new list and send stuff there instead. Then kill RT [20:55:52] !log furud.codfw - initial puppet install, signing certs [20:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:20] Krinkle: sorry, missed your earlies message [20:58:20] (03PS1) 10Yuvipanda: tools: Install toollabs-webservice on services hosts [puppet] - 10https://gerrit.wikimedia.org/r/284074 [20:58:24] what is what based on? [20:58:29] the week old timestamp you saw? [20:58:34] (03PS2) 10Yuvipanda: tools: Install toollabs-webservice on services hosts [puppet] - 10https://gerrit.wikimedia.org/r/284074 [20:58:34] !log OS install on conf200[1-3] [20:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:44] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2215655 (10Dzahn) [21:00:22] it is log.retention.hours in kafka configs [21:00:37] the default is to delete data older than a week [21:01:56] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [21:02:27] (03PS8) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 [21:02:39] (03CR) 10Yuvipanda: [C: 032] tools: Install toollabs-webservice on services hosts [puppet] - 10https://gerrit.wikimedia.org/r/284074 (owner: 10Yuvipanda) [21:03:07] ottomata: I mean it being earliest to 8 days [21:03:17] Ah, okay [21:03:20] (03PS9) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 [21:03:23] I guess only some things are 30 days then [21:03:27] ok [21:04:45] 06Operations, 10Analytics-EventLogging, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2215675 (10Ottomata) Talked a bit with Timo in IRC. The new version of pykafka change the default value of `auto_offset_reset` to from latest t... [21:05:15] (03CR) 10Andrew Bogott: [C: 032] Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 (owner: 10Andrew Bogott) [21:05:30] Krinkle: ja kafka is just a log buffer, the retention there is mainly for performance reasons. if you haven't consumed a message to process within log.retention.hours, then you waited too long! [21:05:33] :) [21:05:43] PROBLEM - puppet last run on furud is CRITICAL: CRITICAL: Puppet has 2 failures [21:06:32] 06Operations, 10Traffic: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2215690 (10Krinkle) [21:06:39] 06Operations, 06Mobile-Apps, 10Traffic: WikipediaApp for Android hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2215676 (10Krinkle) [21:06:42] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [21:07:50] 06Operations, 06Mobile-Apps, 10Traffic: WikipediaApp for Android hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2215676 (10Krinkle) [21:08:38] (03PS2) 10Andrew Bogott: Horizon: Add some help_text to the login fields [puppet] - 10https://gerrit.wikimedia.org/r/284025 (https://phabricator.wikimedia.org/T132694) [21:09:36] 06Operations, 10Analytics-EventLogging, 06Performance-Team, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2215696 (10Krinkle) [21:09:42] PROBLEM - gitblit process on furud is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar gitblit.jar [21:12:12] (03Abandoned) 10BBlack: Import Upstream version 1.9.12 [software/nginx] (wmf-1.9.12-1) - 10https://gerrit.wikimedia.org/r/279989 (https://phabricator.wikimedia.org/T96848) (owner: 10BBlack) [21:12:15] (03Abandoned) 10BBlack: multicert + libssl1.0.2 patches for 1.9.12 [software/nginx] (wmf-1.9.12-1) - 10https://gerrit.wikimedia.org/r/279990 (https://phabricator.wikimedia.org/T96848) (owner: 10BBlack) [21:12:19] (03Abandoned) 10BBlack: nginx (1.9.12-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.9.12-1) - 10https://gerrit.wikimedia.org/r/279991 (https://phabricator.wikimedia.org/T96848) (owner: 10BBlack) [21:12:33] (03CR) 10Andrew Bogott: [C: 032] Horizon: Add some help_text to the login fields [puppet] - 10https://gerrit.wikimedia.org/r/284025 (https://phabricator.wikimedia.org/T132694) (owner: 10Andrew Bogott) [21:12:37] (03PS1) 10BBlack: multicert + libssl1.0.2 patches for 1.9.14 [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284075 (https://phabricator.wikimedia.org/T96848) [21:12:39] (03PS1) 10BBlack: pull down 3x HTTP/2 fixes from nginx master [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284076 [21:12:41] (03PS1) 10BBlack: nginx (1.9.14-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284077 (https://phabricator.wikimedia.org/T96848) [21:14:48] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2215724 (10BBlack) ^ The 1.9.14 commits have the chrome http/2 fix in place, too. I haven't built or tested these yet (pinkunicorn still on the 1.9.12 patches). [21:17:42] 06Operations, 06Mobile-Apps, 10Traffic: WikipediaApp for Android hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2215676 (10Krenair) 2014-08-*? Those are quite old versions. Did you get anything more recent than May 2015 (https://gerrit.wikimedia.org/r/#/c/208315/) ? Only ref... [21:18:31] 06Operations: Default gateway unreachable on baham.wikimedia.org after reboot - https://phabricator.wikimedia.org/T131966#2215749 (10BBlack) I don't think this is related to 4.4 or the 8.4 point release. We've been seeing it for a while on several jessie systems. Usually the worst and most-obvious ones are the... [21:23:08] (03CR) 10Alex Monk: "ping" [puppet] - 10https://gerrit.wikimedia.org/r/268921 (owner: 10Alex Monk) [21:26:13] 06Operations, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2215784 (10RobH) The mailing list will need to have archives so we can compare and ensure items are triaged. As it is, items are often ignored during one week, and have to be followed up on the... [21:28:08] (03PS2) 10BBlack: Reuse update-initramfs in lvs::balancer and interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/283678 (owner: 10Ema) [21:28:53] (03CR) 10BBlack: [C: 032 V: 032] Reuse update-initramfs in lvs::balancer and interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/283678 (owner: 10Ema) [21:29:59] !log conf200[1-3] - signing puppet certs, salt-key, initial run [21:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:30:12] PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: Puppet has 34 failures [21:32:13] (03PS1) 10Eevans: [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [21:33:58] !log deploying latest wdqs [21:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:37:18] (03PS1) 10Yuvipanda: Revert "Allow horizon to query the labs puppetmaster for a list of classes" [puppet] - 10https://gerrit.wikimedia.org/r/284079 [21:37:51] 06Operations, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2215802 (10faidon) >>! In T118176#2215784, @RobH wrote: > The mailing list will need to have archives so we can compare and ensure items are triaged. As it is, items are often ignored during one... [21:38:01] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "Allow horizon to query the labs puppetmaster for a list of classes" [puppet] - 10https://gerrit.wikimedia.org/r/284079 (owner: 10Yuvipanda) [21:39:14] 06Operations, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2215819 (10RobH) Agreed. Perhaps as the clinic person adds to the calendar, they can then also reply to that thread on the list. (We would set the list to never email back the senders to said l... [21:41:21] 06Operations, 06Discovery, 10Salt, 03Discovery-Wikidata-Query-Service-Sprint: Failed to deploy WDQS - https://phabricator.wikimedia.org/T132952#2215822 (10Gehel) A restart of the salt-minion fixed the issue. I'm not sure I understand why, but I'm not going to complain... [21:48:25] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2215826 (10Papaul) [21:49:04] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2184249 (10Papaul) a:05Papaul>03Ottomata Setup and installation complete [21:51:47] (03PS1) 10BBlack: Split mobile text cache for lazyloadimages testing [puppet] - 10https://gerrit.wikimedia.org/r/284080 (https://phabricator.wikimedia.org/T127883) [21:52:30] (03PS2) 10BBlack: Split mobile text cache for lazyloadimages testing [puppet] - 10https://gerrit.wikimedia.org/r/284080 (https://phabricator.wikimedia.org/T127883) [22:05:19] 06Operations, 10Traffic: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2215883 (10BBlack) Some pointers from digging around in github: transparency.wm.o definitely has bits refs in the live HTML I see in the browser: https://github.com/wikimedia/wikimedia-TransparencyReport/sea... [22:06:39] (03PS1) 10Dzahn: torrus: ignore lint issue with include for tests [puppet] - 10https://gerrit.wikimedia.org/r/284081 [22:06:41] (03PS1) 10Dzahn: debdeploy: rename init.pp to master.pp to match class name [puppet] - 10https://gerrit.wikimedia.org/r/284082 [22:06:43] (03PS1) 10Dzahn: interface: move rps::modparams to own file [puppet] - 10https://gerrit.wikimedia.org/r/284083 [22:06:45] (03PS1) 10Dzahn: interface: move aggregate_member to own file [puppet] - 10https://gerrit.wikimedia.org/r/284084 [22:06:47] (03PS1) 10Dzahn: (WIP) apache: ignore autoloader lint errors [puppet] - 10https://gerrit.wikimedia.org/r/284085 [22:06:52] 06Operations, 06Mobile-Apps, 10Traffic: WikipediaApp for Android hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2215884 (10Krinkle) A few more (varnish_text, bits.wikimedia.org, WikipediaApp): ``` 10 WikipediaApp/2.0 2014-08 Android 2 WikipediaApp/2.0 2014-09 And... [22:08:21] (03CR) 10jenkins-bot: [V: 04-1] interface: move rps::modparams to own file [puppet] - 10https://gerrit.wikimedia.org/r/284083 (owner: 10Dzahn) [22:08:32] (03CR) 10jenkins-bot: [V: 04-1] interface: move aggregate_member to own file [puppet] - 10https://gerrit.wikimedia.org/r/284084 (owner: 10Dzahn) [22:08:45] (03CR) 10jenkins-bot: [V: 04-1] (WIP) apache: ignore autoloader lint errors [puppet] - 10https://gerrit.wikimedia.org/r/284085 (owner: 10Dzahn) [22:10:05] (03PS1) 10Dzahn: gitblit: ensure /var/lib/gitblit exists [puppet] - 10https://gerrit.wikimedia.org/r/284086 [22:10:26] (03PS2) 10Dzahn: gitblit: ensure /var/lib/gitblit exists [puppet] - 10https://gerrit.wikimedia.org/r/284086 (https://phabricator.wikimedia.org/T123718) [22:10:52] (03CR) 10Dzahn: [C: 032] gitblit: ensure /var/lib/gitblit exists [puppet] - 10https://gerrit.wikimedia.org/r/284086 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [22:12:03] (03PS2) 10Dzahn: torrus: ignore lint issue with include for tests [puppet] - 10https://gerrit.wikimedia.org/r/284081 [22:12:44] (03CR) 10Dzahn: [C: 032] torrus: ignore lint issue with include for tests [puppet] - 10https://gerrit.wikimedia.org/r/284081 (owner: 10Dzahn) [22:13:05] (03PS2) 10Andrew Bogott: deployment-prep shinken: deployment-salt is no longer the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/283783 (owner: 10Alex Monk) [22:13:28] (03CR) 10Yuvipanda: "I've reverted this because it re-introduced auth.conf, which seems to have killed the labs puppetmaster. See Ie7388f49493459e1259ab21cd181" [puppet] - 10https://gerrit.wikimedia.org/r/283728 (owner: 10Andrew Bogott) [22:14:09] (03PS1) 10Eranroz: Show counts in category pages - hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284087 (https://phabricator.wikimedia.org/T132972) [22:14:40] (03CR) 10Andrew Bogott: [C: 032] deployment-prep shinken: deployment-salt is no longer the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/283783 (owner: 10Alex Monk) [22:16:53] (03CR) 10Yuvipanda: Allow horizon to query the labs puppetmaster for a list of classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283728 (owner: 10Andrew Bogott) [22:18:20] RECOVERY - puppet last run on furud is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:19:26] (03PS1) 10Yuvipanda: Just do a $webservice restart [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/284088 [22:20:06] !log resetting labcontrol1001 puppet master with auth.conf which is fixing all the puppet clients in Labs [22:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:21:11] (03PS1) 10Dzahn: gitblit: ensure ./data/ subdir also exists [puppet] - 10https://gerrit.wikimedia.org/r/284089 [22:21:36] (03PS2) 10Dzahn: gitblit: ensure ./data/ subdir also exists [puppet] - 10https://gerrit.wikimedia.org/r/284089 [22:21:56] (03CR) 10Dzahn: [C: 032] gitblit: ensure ./data/ subdir also exists [puppet] - 10https://gerrit.wikimedia.org/r/284089 (owner: 10Dzahn) [22:21:56] andrewbogott: around? [22:22:16] YuviPanda: I think I'm here but my irc client has been lying to me today [22:22:21] (03PS3) 10Dzahn: torrus: ignore lint issue with include for tests [puppet] - 10https://gerrit.wikimedia.org/r/284081 [22:22:35] (03PS1) 10Jcrespo: Revert "Depool pc1005 and pc2005 for cloning" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284090 [22:22:47] andrewbogott: ah, I reverted your puppetmaster patch since it broke things and left a bunch of comments on it :) [22:23:13] yeah, I saw that — think that patch would work with just the apache bits? [22:23:18] (03CR) 10Dzahn: [V: 032] gitblit: ensure ./data/ subdir also exists [puppet] - 10https://gerrit.wikimedia.org/r/284089 (owner: 10Dzahn) [22:24:05] andrewbogott: maybe - I don't know enough apache to comment :D I also only looked at the auth.conf [22:24:11] (03PS4) 10Dzahn: torrus: ignore lint issue with include for tests [puppet] - 10https://gerrit.wikimedia.org/r/284081 [22:24:26] andrewbogott: I think it should... [22:24:34] I'll try [22:24:48] andrewbogott: ok, thanks :D [22:25:15] (03PS1) 10Yurik: Revert "Don't yet allow wikidatasparql graph urls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284091 [22:25:26] (03PS2) 10Yurik: Revert "Don't yet allow wikidatasparql graph urls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284091 [22:27:04] (03PS3) 10Yurik: Revert "Don't yet allow wikidatasparql graph urls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) [22:30:13] (03PS1) 10Ori.livneh: Update parser cache configuration for tag-based hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284093 [22:30:17] AaronSchulz: ^ [22:30:48] L O L [22:30:49] (03PS2) 10Dzahn: debdeploy: rename init.pp to master.pp to match class name [puppet] - 10https://gerrit.wikimedia.org/r/284082 [22:31:11] (03PS3) 10Dzahn: debdeploy: rename init.pp to master.pp to match class name [puppet] - 10https://gerrit.wikimedia.org/r/284082 [22:31:54] I agreed with the patch, I just didn't understood how it was going to be backwards compatible [22:32:29] (03PS1) 10Rush: labstore: shape NFS read for dumps traffic [puppet] - 10https://gerrit.wikimedia.org/r/284094 [22:32:52] jynus: :) [22:33:01] (03PS2) 10Krinkle: rcstream: Add documentation link [puppet] - 10https://gerrit.wikimedia.org/r/271811 [22:34:14] (03CR) 10Yuvipanda: [C: 031] labstore: shape NFS read for dumps traffic [puppet] - 10https://gerrit.wikimedia.org/r/284094 (owner: 10Rush) [22:35:04] (03CR) 10Rush: [C: 032] labstore: shape NFS read for dumps traffic [puppet] - 10https://gerrit.wikimedia.org/r/284094 (owner: 10Rush) [22:36:00] (03CR) 10Aaron Schulz: [C: 031] Update parser cache configuration for tag-based hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284093 (owner: 10Ori.livneh) [22:41:37] (03PS1) 10Chad: Adding furud as antimony replacement to git replication [puppet] - 10https://gerrit.wikimedia.org/r/284097 [22:42:43] (03CR) 10jenkins-bot: [V: 04-1] Adding furud as antimony replacement to git replication [puppet] - 10https://gerrit.wikimedia.org/r/284097 (owner: 10Chad) [22:42:58] (03CR) 10Jcrespo: [C: 032] Revert "Depool pc1005 and pc2005 for cloning" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284090 (owner: 10Jcrespo) [22:43:26] (03PS1) 10Krinkle: noc: Use favicon from wikimedia.org instead of bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284098 (https://phabricator.wikimedia.org/T107430) [22:43:47] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#1494832 (10MaxSem) The favicon hits might be from bookmarks. I wonder if browsers will update the icon locations if we start hard-redirecting to new locations... [22:44:40] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool pc1005 (duration: 00m 26s) [22:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:45:23] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool pc2005 (duration: 00m 33s) [22:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:46:05] (03CR) 10Krinkle: [C: 032] noc: Use favicon from wikimedia.org instead of bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284098 (https://phabricator.wikimedia.org/T107430) (owner: 10Krinkle) [22:46:49] (03PS2) 10Krinkle: noc: Use favicon from wikimedia.org instead of bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284098 (https://phabricator.wikimedia.org/T107430) [22:47:06] (03CR) 10Aaron Schulz: "<<+channel:exception +normalized_message:*DBTransactionError*>>" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (owner: 10Aaron Schulz) [22:47:34] (03PS1) 10Dzahn: gitblit: add systemd unit, if jessie use it [puppet] - 10https://gerrit.wikimedia.org/r/284100 (https://phabricator.wikimedia.org/T123718) [22:47:51] (03PS2) 10Aaron Schulz: Lowered $wgMaxUserDBWriteDuration to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (https://phabricator.wikimedia.org/T95501) [22:48:13] (03CR) 10jenkins-bot: [V: 04-1] Lowered $wgMaxUserDBWriteDuration to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (https://phabricator.wikimedia.org/T95501) (owner: 10Aaron Schulz) [22:48:46] (03PS2) 10Jcrespo: Update parser cache configuration for tag-based hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284093 (owner: 10Ori.livneh) [22:49:02] jynus: don't merge it yet, please [22:49:10] ofc [22:49:30] I was just rebasing because I made the conflict [22:49:39] if I break it, I fix it [22:49:41] (03PS2) 10Dzahn: gitblit: add systemd unit, if jessie use it [puppet] - 10https://gerrit.wikimedia.org/r/284100 (https://phabricator.wikimedia.org/T123718) [22:51:08] next rebase will be free [22:51:12] (03PS1) 10Krinkle: Remove unused $wmfHostnames['bits'] configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284101 (https://phabricator.wikimedia.org/T107430) [22:52:07] !log krinkle@tin Synchronized docroot/noc/conf: noc: remote bits references (duration: 00m 32s) [22:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:52:11] (03PS2) 10Chad: Adding furud as antimony replacement to git replication [puppet] - 10https://gerrit.wikimedia.org/r/284097 [22:52:20] (03CR) 10Krinkle: "Well test out in beta and canary servers first, just in case." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284101 (https://phabricator.wikimedia.org/T107430) (owner: 10Krinkle) [22:52:40] (03PS1) 10Jcrespo: Depool pc1006 to clone it to pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284102 [22:53:26] (03CR) 10Dzahn: [C: 032] "noop on antimony per compiler and furud is not active yet" [puppet] - 10https://gerrit.wikimedia.org/r/284100 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [22:53:53] (03PS1) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/284103 [22:55:28] PROBLEM - HHVM rendering on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:55:47] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:57:46] PROBLEM - configured eth on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:57:57] PROBLEM - dhclient process on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:58:07] PROBLEM - salt-minion processes on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:58:08] PROBLEM - Check size of conntrack table on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:58:08] PROBLEM - DPKG on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:58:17] PROBLEM - nutcracker process on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:58:37] PROBLEM - HHVM processes on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:58:57] PROBLEM - RAID on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:59:18] PROBLEM - SSH on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:59:55] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2216188 (10Dzahn) existing puppet class now runs without errors on furud, systemd unit file is added when on jessie. next is... [23:00:33] (03PS2) 10Jcrespo: Depool pc1006 to clone it to pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284102 [23:01:47] PROBLEM - Disk space on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:02:36] PROBLEM - nutcracker port on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:03:03] (03CR) 10Jcrespo: [C: 032] Depool pc1006 to clone it to pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284102 (owner: 10Jcrespo) [23:04:54] can someone check console on mw1145 and reboot or whatever? [23:05:49] who is scaping? [23:05:51] (03PS2) 10Yuvipanda: Just do a $webservice restart [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/284088 [23:06:03] i'm sync-filing, it is hung at 99% [23:06:18] because of mw1145 being unresponsive [23:06:25] it really needs a timeout [23:06:31] !log ori@tin Synchronized php-1.27.0-wmf.21/includes/objectcache/SqlBagOStuff.php: Ie9799f5ea: Allow tag names for SqlBagOStuff consistent hashing (duration: 04m 08s) [23:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:40] it think there is, it is just it is very long [23:06:41] i killed the hung subprocess [23:06:50] nah, i had to kill it [23:07:18] all yours [23:07:20] how that will interact with my changes? [23:07:23] is it ok? [23:07:36] yes, right now the behavior is not changed at all [23:07:41] ok, thanks [23:08:08] RECOVERY - nutcracker process on mw1145 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:08:16] RECOVERY - Check size of conntrack table on mw1145 is OK: OK: nf_conntrack is 0 % full [23:08:17] RECOVERY - salt-minion processes on mw1145 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:08:48] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool pc2006 (duration: 01m 01s) [23:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:21] (03PS1) 10Dzahn: gitblit: install openjdk-7-jre [puppet] - 10https://gerrit.wikimedia.org/r/284104 (https://phabricator.wikimedia.org/T123718) [23:10:23] (03PS3) 10Ori.livneh: Update parser cache configuration for tag-based hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284093 [23:10:31] jynus: sanity-check rebase? ^ [23:10:47] (03PS2) 10Dzahn: gitblit: install openjdk-7-jre [puppet] - 10https://gerrit.wikimedia.org/r/284104 (https://phabricator.wikimedia.org/T123718) [23:11:02] ok for me, ori [23:11:10] (03PS2) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/284103 [23:12:04] (03CR) 10Ori.livneh: [C: 032] Update parser cache configuration for tag-based hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284093 (owner: 10Ori.livneh) [23:12:24] (03PS3) 10Yuvipanda: Just do a $webservice restart [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/284088 [23:12:30] (03Merged) 10jenkins-bot: Update parser cache configuration for tag-based hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284093 (owner: 10Ori.livneh) [23:13:41] (03CR) 10Yuvipanda: [C: 032] Just do a $webservice restart [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/284088 (owner: 10Yuvipanda) [23:14:07] PROBLEM - Disk space on restbase1014 is CRITICAL: DISK CRITICAL - free space: /srv 177554 MB (3% inode=99%) [23:14:17] PROBLEM - nutcracker process on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:14:18] PROBLEM - Check size of conntrack table on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:14:24] urandom: ^ [23:14:27] PROBLEM - salt-minion processes on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:14:51] !log ori@tin Synchronized wmf-config: I0ec3c015f: Update parser cache configuration for tag-based hashing (duration: 01m 48s) [23:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:05] ori: jynus I'll take care of mw1145 now [23:15:11] ori: thanks [23:16:27] (03CR) 10Dzahn: [C: 032] gitblit: install openjdk-7-jre [puppet] - 10https://gerrit.wikimedia.org/r/284104 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [23:17:03] (03Merged) 10jenkins-bot: Just do a $webservice restart [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/284088 (owner: 10Yuvipanda) [23:17:15] !log rebooted mw1145 from mgmt [23:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:17] RECOVERY - Disk space on restbase1014 is OK: DISK OK [23:19:07] (urandom: also, hi! I just realized I hadn't had a chance to catch up with you in forever and that a single-character ping is not the friendliest way to check in) [23:19:23] ori: hi! [23:19:30] * YuviPanda watches serial console [23:19:34] (03CR) 10BBlack: [C: 031] Remove unused $wmfHostnames['bits'] configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284101 (https://phabricator.wikimedia.org/T107430) (owner: 10Krinkle) [23:19:47] RECOVERY - dhclient process on mw1145 is OK: PROCS OK: 0 processes with command name dhclient [23:19:57] RECOVERY - nutcracker port on mw1145 is OK: TCP OK - 0.000 second response time on port 11212 [23:20:17] RECOVERY - nutcracker process on mw1145 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:20:17] RECOVERY - Check size of conntrack table on mw1145 is OK: OK: nf_conntrack is 0 % full [23:20:26] ori: jynus mw1145 is back up, are either of you going to do a scap or somesuch soon? I'm not sure if that needs to be done to make it catch up to the missed scap. [23:20:27] RECOVERY - salt-minion processes on mw1145 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:20:36] RECOVERY - Disk space on mw1145 is OK: DISK OK [23:20:37] RECOVERY - DPKG on mw1145 is OK: All packages OK [23:20:47] RECOVERY - HHVM processes on mw1145 is OK: PROCS OK: 6 processes with command name hhvm [23:20:50] YuviPanda: just run `sync-common` there [23:20:55] (03CR) 10Paladox: [C: 031] Adding furud as antimony replacement to git replication [puppet] - 10https://gerrit.wikimedia.org/r/284097 (owner: 10Chad) [23:20:56] as a normal user [23:21:06] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 73102 bytes in 7.754 second response time [23:21:06] RECOVERY - SSH on mw1145 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [23:21:06] 06Operations, 06Discovery, 10Salt, 03Discovery-Wikidata-Query-Service-Sprint: Failed to deploy WDQS - https://phabricator.wikimedia.org/T132952#2216227 (10Smalyshev) 05Open>03Resolved [23:21:18] bd808: ok. [23:21:22] yeah, what bd808 said [23:21:28] RECOVERY - configured eth on mw1145 is OK: OK - interfaces up [23:21:30] kk doing [23:21:36] RECOVERY - RAID on mw1145 is OK: OK: no RAID installed [23:21:58] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.045 second response time [23:22:03] !log running sync-common on mw1145 to catch up on deploys [23:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:36] (03PS3) 10Dzahn: Adding furud as antimony replacement to git replication [puppet] - 10https://gerrit.wikimedia.org/r/284097 (owner: 10Chad) [23:22:45] (03CR) 10Dzahn: [C: 032] Adding furud as antimony replacement to git replication [puppet] - 10https://gerrit.wikimedia.org/r/284097 (owner: 10Chad) [23:23:56] ok, it finished [23:24:57] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:24:59] YuviPanda: thanks! <3 [23:25:18] ori: np. [23:26:12] (03PS3) 10Ori.livneh: rcstream: Add documentation link [puppet] - 10https://gerrit.wikimedia.org/r/271811 (owner: 10Krinkle) [23:26:31] (03CR) 10Ori.livneh: [C: 032 V: 032] rcstream: Add documentation link [puppet] - 10https://gerrit.wikimedia.org/r/271811 (owner: 10Krinkle) [23:29:55] (03PS2) 10Ori.livneh: Remove unused $wmfHostnames['bits'] configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284101 (https://phabricator.wikimedia.org/T107430) (owner: 10Krinkle) [23:30:01] (03CR) 10Ori.livneh: [C: 032] Remove unused $wmfHostnames['bits'] configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284101 (https://phabricator.wikimedia.org/T107430) (owner: 10Krinkle) [23:30:36] (03Merged) 10jenkins-bot: Remove unused $wmfHostnames['bits'] configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284101 (https://phabricator.wikimedia.org/T107430) (owner: 10Krinkle) [23:30:44] ori: haven't tested yet - if you're deploying be sure to load do a full page view with js/css etc. via a canary server first. [23:31:03] yep [23:31:05] (and no new php notices in the logs :D) [23:31:07] k, thanks! [23:31:17] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [23:31:27] bbl after dinner [23:33:23] !log ori@tin Synchronized wmf-config/CommonSettings.php: I1547834: Remove unused ['bits'] configuration (duration: 00m 27s) [23:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:25] 06Operations, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2216274 (10Dzahn) >>! In T118176#2215784, @RobH wrote: > The mailing list will need to have archives so we can compare and ensure items are triaged. As it is, items are often ignored during one... [23:36:55] (03PS1) 10BBlack: Common VCL: remove wikimedia.org subdomain HTTPS redirect exception [puppet] - 10https://gerrit.wikimedia.org/r/284106 (https://phabricator.wikimedia.org/T102826) [23:37:10] 06Operations, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2216278 (10RobH) It should be private archives, as we don't want the notices public. (At least, we have not in the past, as it points out where our infrastructure may be depreciated for attack.) [23:37:17] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [23:37:27] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2216279 (10MaxSem) Did some investigation on geoiplookup hits: * Most of hits have no referer, yet the same UA: `Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/5.0)` - prote... [23:37:47] (03CR) 10BBlack: "Note: this is mostly a functional no-op since we've already cleaned up DNS and cache_misc in this regard. The exception is *.planet.wm.o," [puppet] - 10https://gerrit.wikimedia.org/r/284106 (https://phabricator.wikimedia.org/T102826) (owner: 10BBlack) [23:40:13] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2216284 (10BBlack) Not sure what you mean about "protection for HTTP-only", but we do universally HTTPS-redirect all bits.wikimedia.org traffic (it's not UA-sensitive or anything). [23:41:34] (03PS1) 10Dzahn: gitblit: fix unit file, don't use /opt/ [puppet] - 10https://gerrit.wikimedia.org/r/284107 [23:43:36] (03CR) 10Dzahn: "planet should be just fine. it's already doing the redirect in the Apache backend with RewriteCond %{HTTP:X-Forwarded-Proto} !https" [puppet] - 10https://gerrit.wikimedia.org/r/284106 (https://phabricator.wikimedia.org/T102826) (owner: 10BBlack) [23:45:01] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2216306 (10MaxSem) I mean, IE 10 doesn't report referrer if original URL was http even if it got redirected to https. [23:45:17] (03PS2) 10Dzahn: gitblit: fix unit file, don't use /opt/ [puppet] - 10https://gerrit.wikimedia.org/r/284107 [23:46:01] (03CR) 10Dzahn: [C: 032] gitblit: fix unit file, don't use /opt/ [puppet] - 10https://gerrit.wikimedia.org/r/284107 (owner: 10Dzahn) [23:51:59] (03CR) 10Smalyshev: [C: 031] Revert "Don't yet allow wikidatasparql graph urls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) (owner: 10Yurik)