[00:18:06] PROBLEM - puppet last run on db1103 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:30:04] 06Operations: puppet mechanism updating motd is broken - https://phabricator.wikimedia.org/T80998#882211 (10faidon) More information please? In any case, if it is, please file a new task, don't revive all these old ones. [00:31:36] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:46:06] RECOVERY - puppet last run on db1103 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [01:00:36] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [01:11:56] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:40:56] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [02:11:56] PROBLEM - nova-compute process on labvirt1013 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [02:13:56] RECOVERY - nova-compute process on labvirt1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [02:20:13] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 14s) [02:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:11] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu May 18 02:26:11 UTC 2017 (duration 5m 59s) [02:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:56] (03PS3) 10Aaron Schulz: Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210) [05:01:01] !log insert decryption key for WMF Board Election [05:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:14] 06Operations, 10ops-codfw, 10DBA: db2058: Predictive RAID failure - https://phabricator.wikimedia.org/T165498#3272571 (10Marostegui) And the disk finally failed: T165629 ``` physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, Failed) ``` Let's follow up on the failed task. [05:54:56] 06Operations, 10ops-codfw, 10DBA: db2058: Predictive RAID failure - https://phabricator.wikimedia.org/T165498#3272573 (10Marostegui) 05Open>03Invalid [05:55:57] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T165629#3272144 (10Marostegui) p:05Triage>03Normal a:03Papaul Please @Papaul proceed and change this disk when you have time. Thanks! [05:58:46] !log Deploy alter table s2.revision table - dbstore1001 [05:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:56] (03CR) 10Marostegui: [C: 032] MariaDB: Repool db2062 after maintenanace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [06:02:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354179 (https://phabricator.wikimedia.org/T162611) [06:04:31] (03Merged) 10jenkins-bot: MariaDB: Repool db2062 after maintenanace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [06:05:14] (03PS2) 10Marostegui: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354179 (https://phabricator.wikimedia.org/T162611) [06:05:43] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2062 - T116557 (duration: 00m 39s) [06:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:51] T116557: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557 [06:05:52] (03CR) 10jenkins-bot: MariaDB: Repool db2062 after maintenanace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [06:07:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354179 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:08:24] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354179 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:08:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354179 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:09:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1090 - T162611 (duration: 00m 38s) [06:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:42] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [06:10:14] !log Deploy alter table s2.revision table - db1090 - T162611 [06:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:23] !log Deploy alter table s2.revision table - dbstore1001 - T162611 [06:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:55] (03CR) 10Marostegui: [C: 032] mariadb: clean up duplicate GRANTs for phstats user [puppet] - 10https://gerrit.wikimedia.org/r/348779 (owner: 10Dzahn) [06:21:26] !log Deploy alter table s2.revision table - labsdb1001 - T162611 [06:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:34] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [06:24:27] !log Deploy alter table on s2.ptwiki directly on codfw master (db2017) after running the clean up duplicates script - https://phabricator.wikimedia.org/T164530 [06:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:44] !log installing tiff security updates [06:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:54] (03CR) 10Jcrespo: [C: 031] "OK, but not sure if it will work (e.g. if some text has been already sent)." [software/dbtree] - 10https://gerrit.wikimedia.org/r/353388 (https://phabricator.wikimedia.org/T163143) (owner: 10Dzahn) [07:00:34] (03PS1) 10Ema: Bump version number in setup.py [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/354180 [07:01:41] !log Deploy alter table on s2.plwiki directly on codfw master (db2017) after running the clean up duplicates script - https://phabricator.wikimedia.org/T164530 [07:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:39] (03CR) 10Marostegui: "I merged this. Thanks for picking this up Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/348779 (owner: 10Dzahn) [07:14:32] 06Operations, 10Traffic: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#3272649 (10ema) [07:14:39] 06Operations, 10Traffic: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#3272664 (10ema) p:05Triage>03Normal [07:23:26] PROBLEM - MariaDB Slave Lag: s1 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:31] PROBLEM - mysqld processes on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:36] PROBLEM - MariaDB Slave SQL: s6 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:36] PROBLEM - Disk space on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:46] PROBLEM - MariaDB Slave Lag: s4 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:46] PROBLEM - MariaDB Slave IO: s4 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:46] PROBLEM - MariaDB Slave IO: s1 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:46] PROBLEM - MariaDB Slave SQL: s7 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:46] PROBLEM - MariaDB Slave IO: s3 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:46] PROBLEM - configured eth on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:46] PROBLEM - MariaDB Slave IO: m2 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:47] PROBLEM - MariaDB Slave IO: s2 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:49] PROBLEM - MariaDB Slave IO: m3 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:52] PROBLEM - MariaDB disk space on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:52] PROBLEM - dhclient process on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:52] PROBLEM - MariaDB Slave Lag: x1 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:52] PROBLEM - MariaDB Slave IO: s6 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:52] PROBLEM - MariaDB Slave SQL: m3 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:52] PROBLEM - DPKG on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:52] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:56] PROBLEM - MariaDB Slave SQL: x1 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:56] PROBLEM - MariaDB Slave IO: s5 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:56] PROBLEM - MariaDB Slave IO: s7 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:23:56] PROBLEM - Check size of conntrack table on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:24:01] :| [07:24:08] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:24:08] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:24:16] PROBLEM - MariaDB Slave Lag: s3 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:24:16] PROBLEM - Check systemd state on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:24:16] PROBLEM - MariaDB Slave SQL: s5 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:24:16] PROBLEM - MariaDB Slave Lag: m3 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:24:21] why is it complaining=? [07:24:22] I guess dbstore2001 died ? [07:24:27] it didn't [07:24:27] nope [07:24:28] it is up [07:25:14] ok then, it's icinga .. I am looking [07:25:22] icinga checks' duration is 6d 23h, maybe some expired downtime? [07:26:55] but it is complaining about evertything [07:26:56] PROBLEM - Check the NTP synchronisation status of timesyncd on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:26:57] connect to address 10.192.0.32 port 5666: Connection refused [07:27:00] disk space, puppet, etc [07:27:06] PROBLEM - HP RAID on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:27:13] npr issue? [07:27:16] RECOVERY - MariaDB Slave Lag: m3 on dbstore2001 is OK: OK slave_sql_lag not a slave [07:27:22] npre? [07:27:25] !log restart nagios-nrpe-server on dbstore2001 [07:27:26] RECOVERY - MariaDB Slave Lag: s1 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86638.68 seconds [07:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:32] RECOVERY - mysqld processes on dbstore2001 is OK: PROCS OK: 1 process with command name mysqld [07:27:34] now.. why did that happen ? [07:27:36] RECOVERY - Disk space on dbstore2001 is OK: DISK OK [07:27:36] RECOVERY - MariaDB Slave SQL: s6 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [07:27:46] RECOVERY - MariaDB Slave Lag: s4 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86684.35 seconds [07:27:46] RECOVERY - MariaDB Slave IO: s4 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:27:46] RECOVERY - MariaDB Slave IO: s1 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:27:46] RECOVERY - MariaDB Slave SQL: s7 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:27:46] RECOVERY - MariaDB Slave IO: s3 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:27:46] RECOVERY - configured eth on dbstore2001 is OK: OK - interfaces up [07:27:46] RECOVERY - MariaDB Slave IO: m2 on dbstore2001 is OK: OK slave_io_state not a slave [07:27:47] RECOVERY - MariaDB Slave IO: s2 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:27:52] RECOVERY - MariaDB disk space on dbstore2001 is OK: DISK OK [07:27:52] RECOVERY - MariaDB Slave IO: m3 on dbstore2001 is OK: OK slave_io_state not a slave [07:27:52] RECOVERY - dhclient process on dbstore2001 is OK: PROCS OK: 0 processes with command name dhclient [07:27:56] RECOVERY - DPKG on dbstore2001 is OK: All packages OK [07:27:56] RECOVERY - MariaDB Slave SQL: x1 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [07:27:56] RECOVERY - MariaDB Slave SQL: m3 on dbstore2001 is OK: OK slave_sql_state not a slave [07:27:56] RECOVERY - MariaDB Slave SQL: s1 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:27:56] RECOVERY - Check size of conntrack table on dbstore2001 is OK: OK: nf_conntrack is 0 % full [07:27:56] RECOVERY - MariaDB Slave IO: s7 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:27:56] RECOVERY - MariaDB Slave IO: s5 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:27:57] RECOVERY - MariaDB Slave Lag: m2 on dbstore2001 is OK: OK slave_sql_lag not a slave [07:28:06] RECOVERY - MariaDB Slave Lag: x1 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 77036.97 seconds [07:28:06] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86521.97 seconds [07:28:06] RECOVERY - MariaDB Slave IO: x1 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:28:06] RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures [07:28:07] RECOVERY - MariaDB Slave SQL: m2 on dbstore2001 is OK: OK slave_sql_state not a slave [07:28:07] RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86512.68 seconds [07:28:07] RECOVERY - MariaDB Slave IO: s6 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:28:07] RECOVERY - salt-minion processes on dbstore2001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [07:28:07] RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:28:08] RECOVERY - Check whether ferm is active by checking the default input chain on dbstore2001 is OK: OK ferm input default policy is set [07:28:08] RECOVERY - MariaDB Slave SQL: s2 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:28:09] RECOVERY - MariaDB Slave SQL: s4 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:28:34] marostegui, 6 days ago was when that server had problems, could that had caused a problem on monitoring or something? [07:28:48] jynus: Don't think so, beause it was only MySQL [07:28:50] it is a question, I do not know when that had problems [07:28:56] The rest was perfectly fine [07:29:09] May 18 07:27:12 dbstore2001 systemd[1]: Stopping LSB: Start/Stop the Nagios remote plugin execution daemon... [07:29:09] May 11 07:47:32 dbstore2001 sudo[42906]: pam_unix(sudo:session): session closed for user root [07:29:09] May 11 07:47:32 dbstore2001 sudo[42906]: pam_unix(sudo:session): session opened for user root by (uid=0) [07:29:09] May 11 07:47:32 dbstore2001 sudo[42906]: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib/nagios/plugins/check_ferm [07:29:16] that's in reverse chronological order [07:29:17] or maybe the script got confused? [07:29:24] why are 7 days missing from nagios-nrpe-server logs [07:29:25] ? [07:29:43] or overloaded and made monitoring fail? [07:30:01] I would restart that server [07:30:11] maybe there are host problems [07:30:29] akosiaris: that's also the icinga checks duration (7d) [07:30:44] ema: ? [07:30:50] During the problems last week, the server had the disk utilization 100% for a few hours, that was all (during mysql start) [07:30:55] I don't follow [07:31:03] https://grafana-admin.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=dbstore2001&var-network=eth0&from=now-7d&to=now [07:31:24] maybe too early for me, but what is the icinga checks duration ? [07:31:40] akosiaris: in icinga the checks were marked as failing since 7d [07:31:44] eg: dbstore2001 Check the NTP synchronisation status of timesyncd CRITICAL 2017-05-18 07:26:47 6d 23h 34m 50s [07:31:45] ah [07:32:14] That does match more or less the server issues [07:32:17] server = mysql issues [07:32:32] so after mysql got overloaded, maybe monitoring got in a bad state [07:32:32] ok so nagios-nrpe-server was most probably not running for the last 7 days [07:32:37] right [07:32:41] something like that [07:32:49] was the server in scheduled downtime or something ? [07:32:54] and maybe manuel had acked the problems due to the mysql problems? [07:33:00] akosiaris: I probably downtimed it while debugging the issues [07:33:03] acked/downtime/etc [07:33:06] which is ok [07:33:17] if it happens on friday and it is not a critical host like this [07:33:29] But how can MySQL overload break monitoring? Becausei f you guys check the graph I posted earlier, the server wasn't really overloaded [07:33:30] "just don't page me for a week" [07:33:37] in theory no [07:33:41] but who knows [07:33:44] it has so many checks [07:33:49] about replication [07:33:55] that maybe it got overloaded too [07:34:25] Yes, could be [07:34:27] it will probably have >21 replication checks [07:34:36] Some sort of cascade effect or something [07:34:59] which all break becaues the timeout on the check is not handled appropiately (that is what I am fixing in the new check script) [07:35:17] or [07:35:22] alternatively [07:35:35] npre could have cause the mysql issues in the first place [07:36:08] that is an interesting thought indeed [07:36:11] (indirectly, because of the check does not respond well to the tiemeout) [07:36:36] so there are 2 actionables here- fix the check script, which I was going to do anyway [07:36:43] !log installing freetype security updates on trusty (jessie already fixed) [07:36:48] and investigate why npre can die [07:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:16] RECOVERY - HP RAID on dbstore2001 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor [07:37:31] it is not only the mysql check [07:37:55] the HP RAID is also prone to timeout^ [07:38:04] Normally I have seen graphs not graphing when the server is super overloaded, and in this case, there are no graph gaps [07:38:08] so we have a perfect storm there [07:38:26] no, mysql was working well, at least in the last 7 days [07:38:42] No, I mean all graphs stopping to draw, disk space, load, everything [07:39:01] yes, I am agreein with you :-) [07:39:19] But if we check the last 7 day load graph, we can see that when the mysql issues were happening, the load was nothing too outstanding [07:39:23] very strange [07:39:54] the checks get stuck, they usually won't create more processes [07:40:40] we need to get rid of the blocking checks both on nagios and prometheus [07:40:43] there's nothing in the logs about how nagios-nrpe-server died [07:40:51] I can guess why it was not restarted by puppet [07:41:01] why? [07:41:07] the init script relies on the pid file [07:41:08] not fully killed? [07:41:13] and the pid file was still around [07:41:18] it's a crappy initscript put simply [07:41:36] but the pid is there precisely to read it! :-) [07:41:56] status_of_proc -p $PIDDIR/nrpe.pid "$DAEMON" "$NAME" && exit 0 || exit $? [07:42:00] that's what it does [07:42:10] I would expect that to work fine [07:43:00] yes, it should [07:43:12] so maybe the process was somehow alive and dead? [07:43:21] zombie ? [07:43:29] no it would have shown in my ps [07:43:35] there was nothing running [07:43:40] not unix-zombie [07:44:23] maybe the replacement process was overwritten but could not lock its state or something [07:44:31] I do not know [07:44:39] this is one of a kind [07:44:40] I just killed it btw just to test [07:45:01] I am trying to check puppet logs to see if it was trying to bring it up or complaning our something, and no, nothing [07:45:03] why are we still using init for nagios, is it what jessie has? [07:45:12] yes [07:45:15] ok [07:45:25] ah there we go [07:45:30] to be fair, the checks were failing [07:45:35] so I 've just did a kill [07:45:37] so it is not like we hadn't notice [07:45:49] I do not think there is much actuable there [07:45:55] and service nagios-nrpe-server status reports everything ok [07:46:01] nice [07:46:10] so it is only that check [07:46:11] * nagios-nrpe-server.service - LSB: Start/Stop the Nagios remote plugin execution daemon [07:46:11] Loaded: loaded (/etc/init.d/nagios-nrpe-server) [07:46:11] Active: active (exited) since Thu 2017-05-18 07:43:06 UTC; 2min 13s ago [07:46:11] Process: 31235 ExecStop=/etc/init.d/nagios-nrpe-server stop (code=exited, status=0/SUCCESS) [07:46:12] Process: 31310 ExecStart=/etc/init.d/nagios-nrpe-server start (code=exited, status=0/SUCCESS) [07:46:16] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3739.30 Read Requests/Sec=1828.80 Write Requests/Sec=1.60 KBytes Read/Sec=36236.00 KBytes_Written/Sec=445.20 [07:46:19] so... lemme check what puppet now does [07:46:32] I am really really suprised this has not bitten us worse [07:46:53] yup puppet DID not start nagios-nrpe-server on a test run [07:47:08] so my guess is that somehow nagios-nrpe-server stopped running [07:47:26] sorry, I do not fully understand, but maybe you want to write a ticket about that, and of course I can help? [07:47:50] I 'd prefer to spend the time to ship a systemd unit and fix the problem for good ;-) [07:48:02] I agree :-) [07:48:35] and as I said, checks were failing, so it was not going as badly [07:48:49] the stretch package already has a systemd unit [07:49:31] moritzm: with a sane Restart= value? :) [07:49:57] moritzm: nice.. I guess I 'll just steal it and ship that :-) [07:50:51] ema: Restart=on-abort, yep [07:51:47] oh, it paged, sorry [07:52:16] icinga normally doesn't fail [07:53:28] I like how we have systemd unit checks being done by nrpe [07:53:34] but nrpe itself is not run via systemd [07:53:48] can't help but laugh [07:54:24] :-) [07:54:34] it cannot check itself [07:54:46] ah that would be a nice check [07:55:00] :D [07:56:16] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.30 Read Requests/Sec=3.40 Write Requests/Sec=0.40 KBytes Read/Sec=15.58 KBytes_Written/Sec=4.00 [07:56:56] RECOVERY - Check the NTP synchronisation status of timesyncd on dbstore2001 is OK: OK: synced at Thu 2017-05-18 07:56:47 UTC. [08:01:58] !log reboot rhenium for update to Linux 4.9 [08:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:20] (03PS1) 10Giuseppe Lavagetto: restbase: convert deployment-prep to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/354182 [08:09:26] !log Deploy alter table on s1.enwiki directly on codfw master (db2016) after running the clean up duplicates script - https://phabricator.wikimedia.org/T164530 [08:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:42] (03CR) 10jerkins-bot: [V: 04-1] restbase: convert deployment-prep to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/354182 (owner: 10Giuseppe Lavagetto) [08:09:53] (03PS1) 10Alexandros Kosiaris: nrpe: Ship a systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/354183 [08:10:09] <_joe_> meh [08:12:24] (03PS2) 10Giuseppe Lavagetto: restbase: convert deployment-prep to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/354182 [08:15:46] (03CR) 10Giuseppe Lavagetto: [C: 032] "noop in production" [puppet] - 10https://gerrit.wikimedia.org/r/354182 (owner: 10Giuseppe Lavagetto) [08:16:51] !log reboot dataset1001 for kernel update [08:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:01] (03PS1) 10Elukey: Fix MediaWiki centralauth errors graphite alarm [puppet] - 10https://gerrit.wikimedia.org/r/354184 [08:18:41] (03CR) 10Elukey: [C: 032] Fix MediaWiki centralauth errors graphite alarm [puppet] - 10https://gerrit.wikimedia.org/r/354184 (owner: 10Elukey) [08:20:10] _joe_ you can merge --^ whenever you are ready [08:20:32] (03CR) 10Muehlenhoff: [C: 031] "One nit, but LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354183 (owner: 10Alexandros Kosiaris) [08:21:36] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/mnt/data] [08:22:05] <_joe_> elukey: oh yeah gimme a min sorry [08:22:20] <_joe_> also thanks, I planned on fixing that [08:22:48] I left the other two because the syntax seems fine but no datapoints :( [08:24:08] <_joe_> it wasn't "no datapoints" [08:24:16] <_joe_> it's "not enough datapoints" [08:24:18] <_joe_> and I had a fix [08:24:30] <_joe_> laters :P [08:24:41] <_joe_> deployment-prep is sucking my soul out of me [08:25:20] ok nice (for the metrics, not the soul :) [08:25:26] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:25:32] fixed --^ [08:31:20] _joe_ the aqs refactoring seems really good, thanks a lot for doing it [08:32:09] !log upgrading mw1180-mw1188, mw1200-mw1208 to new hhvm-luasandbox/hhvm-luasandbox-dbg packages [08:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:45] (03PS1) 10Thcipriani: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 [08:35:47] _joe_: <3 [08:36:12] (03PS2) 10Thcipriani: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) [08:37:16] (03PS1) 10Giuseppe Lavagetto: profile::cassandra: remove useless pick() [puppet] - 10https://gerrit.wikimedia.org/r/354187 [08:38:11] <_joe_> greg-g: it's totally not your team's fault, I think we do have a path to a solution I outlined in T161675 [08:38:11] T161675: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 [08:38:49] (03CR) 10Alexandros Kosiaris: "Hm so it looks like out of 215 system::role stanzas, 154 use the role:: prefix and 61 don't. It's clearly not consistent and a sign we wil" [puppet] - 10https://gerrit.wikimedia.org/r/354172 (owner: 10Dzahn) [08:38:55] (03CR) 10Greg Grossmeier: [C: 031] Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [08:39:17] (03CR) 10Greg Grossmeier: [C: 031] "That was Chad btw ^" [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [08:39:58] _joe_: yeah, we were just talking about that. And the <3 is true love, we feel your pain :) [08:40:18] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::cassandra: remove useless pick() [puppet] - 10https://gerrit.wikimedia.org/r/354187 (owner: 10Giuseppe Lavagetto) [08:42:04] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354188 (https://phabricator.wikimedia.org/T159753) [08:43:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354188 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [08:45:11] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354188 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [08:45:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354188 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [08:46:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 - T159753 T164530 (duration: 00m 39s) [08:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:43] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [08:46:44] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [08:48:53] (03CR) 10Elukey: [C: 04-1] "Thanks a lot, looks great! Just a couple of questions/notes for Eric before proceeding:" [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto) [08:49:12] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354189 [08:50:25] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354189 (owner: 10Marostegui) [08:51:26] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354189 (owner: 10Marostegui) [08:51:35] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354189 (owner: 10Marostegui) [08:52:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T159753 T164530 (duration: 00m 39s) [08:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:26] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [08:52:26] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [08:57:00] (03PS1) 10Muehlenhoff: Also strip rpcbind/nfs-common deps on jessie installs [puppet] - 10https://gerrit.wikimedia.org/r/354190 (https://phabricator.wikimedia.org/T106477) [09:01:08] (03PS1) 10Marostegui: db-eqiad.php: Repool db1090, depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354191 (https://phabricator.wikimedia.org/T162611) [09:02:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1090, depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354191 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [09:03:49] (03PS1) 10Jcrespo: mariadb: set db2051 as enabled for full reimage [puppet] - 10https://gerrit.wikimedia.org/r/354192 [09:03:51] (03CR) 10Alexandros Kosiaris: nrpe: Ship a systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354183 (owner: 10Alexandros Kosiaris) [09:03:58] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1090, depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354191 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [09:05:43] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1090, depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354191 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [09:06:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1090, depool db1076 - T162611 (duration: 00m 39s) [09:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:18] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [09:06:52] !log Deploy alter table s2.revision table - db1076 - https://phabricator.wikimedia.org/T162611 [09:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:11] !log upgrading image scalers mw1294/mw1295 to Linux 4.9 and HHVM 3.18 [09:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:58] (03PS2) 10Alexandros Kosiaris: nrpe: Ship a systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/354183 [09:08:49] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354195 (https://phabricator.wikimedia.org/T159753) [09:10:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354195 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [09:11:17] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354195 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [09:11:28] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354195 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [09:14:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1083 - T159753 T164530 (duration: 00m 39s) [09:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:45] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [09:14:45] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [09:15:32] (03PS1) 10Giuseppe Lavagetto: deployment-prep: additional fixes to restbase hiera [puppet] - 10https://gerrit.wikimedia.org/r/354196 [09:16:17] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment-prep: additional fixes to restbase hiera [puppet] - 10https://gerrit.wikimedia.org/r/354196 (owner: 10Giuseppe Lavagetto) [09:16:20] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] deployment-prep: additional fixes to restbase hiera [puppet] - 10https://gerrit.wikimedia.org/r/354196 (owner: 10Giuseppe Lavagetto) [09:16:43] (03PS2) 10Jcrespo: mariadb: set db2051 as enabled for full reimage [puppet] - 10https://gerrit.wikimedia.org/r/354192 [09:17:58] (03CR) 10Marostegui: [C: 031] mariadb: set db2051 as enabled for full reimage [puppet] - 10https://gerrit.wikimedia.org/r/354192 (owner: 10Jcrespo) [09:21:04] 06Operations, 07Zuul: Add a stretch debian package for zuul - https://phabricator.wikimedia.org/T165621#3272870 (10hashar) 05Open>03declined This is premature. Will come to it when it is time :-} [09:23:52] (03PS1) 10Marostegui: db-eqiad.php: Repool db1083, depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354200 (https://phabricator.wikimedia.org/T159753) [09:25:20] (03PS1) 10Giuseppe Lavagetto: restbase: remove legacy classes, roles [puppet] - 10https://gerrit.wikimedia.org/r/354201 [09:25:22] (03PS1) 10Giuseppe Lavagetto: cassandra: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/354202 [09:28:25] (03CR) 10Giuseppe Lavagetto: [C: 032] restbase: remove legacy classes, roles [puppet] - 10https://gerrit.wikimedia.org/r/354201 (owner: 10Giuseppe Lavagetto) [09:28:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1083, depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354200 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [09:28:41] (03CR) 10Giuseppe Lavagetto: [C: 032] cassandra: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/354202 (owner: 10Giuseppe Lavagetto) [09:31:08] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1083, depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354200 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [09:31:20] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1083, depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354200 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [09:33:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1083, depool db1080 - T159753 T164530 (duration: 00m 38s) [09:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:36] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [09:33:37] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [09:37:32] (03PS3) 10Jcrespo: mariadb: set db2051 as enabled for full reimage [puppet] - 10https://gerrit.wikimedia.org/r/354192 [09:43:33] (03PS1) 10Marostegui: db-eqiad.php: Repool db1080, depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354203 (https://phabricator.wikimedia.org/T159753) [09:45:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1080, depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354203 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [09:45:57] (03CR) 10Jcrespo: [C: 032] mariadb: set db2051 as enabled for full reimage [puppet] - 10https://gerrit.wikimedia.org/r/354192 (owner: 10Jcrespo) [09:46:14] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1080, depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354203 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [09:46:23] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1080, depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354203 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [09:47:07] !log upgrading image scalers in codfw to Linux 4.9 and HHVM 3.18 [09:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1080, depool db1073 - T159753 T164530 (duration: 00m 39s) [09:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:09] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [09:49:09] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [09:59:08] PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [10:00:08] RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 74668 bytes in 0.236 second response time [10:07:51] (03PS7) 10Mforns: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [10:08:35] (03CR) 10jerkins-bot: [V: 04-1] [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [10:08:44] 06Operations, 10Gerrit: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631#3272935 (10demon) We can't move them behind LVS. Unlike Phabricator, which uses a separate hostname for the SSH service, Gerrit exposes them over the same domain. Last time we... [10:10:49] (03PS1) 10Jcrespo: [WIP]Initial commit of existent python scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/354206 [10:21:33] (03PS8) 10Elukey: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) [10:22:24] (03CR) 10jerkins-bot: [V: 04-1] [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [10:26:36] (03PS9) 10Elukey: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) [10:37:28] (03PS1) 10Alexandros Kosiaris: Document servermon optimization [puppet] - 10https://gerrit.wikimedia.org/r/354207 (https://phabricator.wikimedia.org/T164604) [10:54:58] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] New upstream version 1.8.3 [calico-cni] - 10https://gerrit.wikimedia.org/r/353867 (owner: 10Giuseppe Lavagetto) [10:55:55] 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Enable Kafka TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#3273006 (10Ottomata) [10:56:18] PROBLEM - swift-object-replicator on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:19] PROBLEM - swift-container-replicator on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:19] PROBLEM - swift-object-updater on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:19] PROBLEM - swift-account-server on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:19] PROBLEM - swift-container-auditor on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:19] PROBLEM - swift-account-reaper on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:28] PROBLEM - swift-account-replicator on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:28] PROBLEM - swift-account-auditor on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:28] PROBLEM - salt-minion processes on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:30] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Updating debian version [calico-cni] - 10https://gerrit.wikimedia.org/r/353868 (owner: 10Giuseppe Lavagetto) [10:57:08] RECOVERY - swift-object-replicator on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [10:57:08] RECOVERY - swift-object-updater on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [10:57:09] RECOVERY - swift-account-server on ms-be1019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [10:57:09] RECOVERY - swift-container-replicator on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [10:57:09] RECOVERY - swift-container-auditor on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:57:09] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] package name change [calico-cni] - 10https://gerrit.wikimedia.org/r/353869 (owner: 10Giuseppe Lavagetto) [10:57:18] RECOVERY - swift-account-reaper on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [10:57:18] RECOVERY - swift-account-replicator on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [10:57:18] RECOVERY - swift-account-auditor on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [10:57:18] RECOVERY - salt-minion processes on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:01:42] 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3273015 (10Ottomata) [11:10:26] !log Run pt-table-checksum on s7.metawiki - T163190 [11:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:35] T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190 [11:11:59] (03PS3) 10Hashar: interface: add rspec boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/340420 [11:13:12] (03PS10) 10Mforns: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [11:19:27] (03PS5) 10Hashar: interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 [11:20:12] (03PS1) 10Marostegui: db-eqiad.php: Repool db1073, depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354208 (https://phabricator.wikimedia.org/T159753) [11:20:26] (03PS1) 10Alexandros Kosiaris: Update docker-host.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/354209 [11:20:28] (03CR) 10jerkins-bot: [V: 04-1] interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar) [11:20:31] (03CR) 10Hashar: "Rebased. Interestingly the add_ip6_mapped no more use the $::interfaces but $facts['interfaces'] so I had to slightly update the pre cond" [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar) [11:20:40] 06Operations, 10Salt, 06Services, 10Trebuchet: `git deploy service restart` asked for sudo password - https://phabricator.wikimedia.org/T126359#3273037 (10demon) 05Open>03declined Nobody cares about Trebuchet anymore. [11:20:50] 10Blocked-on-Operations, 06Operations, 10Parsoid, 10Salt: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#3273060 (10demon) [11:20:56] 06Operations, 10Salt, 10Trebuchet, 13Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#3273058 (10demon) 05Open>03declined Nobody cares about Trebuchet anymore. [11:21:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1073, depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354208 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [11:22:53] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1073, depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354208 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [11:23:01] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1073, depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354208 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [11:23:17] (03PS2) 10Alexandros Kosiaris: Document servermon optimization [puppet] - 10https://gerrit.wikimedia.org/r/354207 (https://phabricator.wikimedia.org/T164604) [11:23:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Document servermon optimization [puppet] - 10https://gerrit.wikimedia.org/r/354207 (https://phabricator.wikimedia.org/T164604) (owner: 10Alexandros Kosiaris) [11:23:36] (03PS2) 10Alexandros Kosiaris: Update docker-host.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/354209 [11:23:41] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update docker-host.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/354209 (owner: 10Alexandros Kosiaris) [11:23:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1073, depool db1072 - T159753 T164530 (duration: 00m 39s) [11:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:03] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [11:24:03] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [11:24:33] (03PS6) 10Hashar: interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 [11:30:05] (03CR) 10Hashar: [C: 031] "I have cherry picked the patch on deployment-prep again :-}" [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar) [11:34:16] (03PS1) 10Alexandros Kosiaris: Remove non-ascii character from servermon.rb [puppet] - 10https://gerrit.wikimedia.org/r/354212 [11:34:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove non-ascii character from servermon.rb [puppet] - 10https://gerrit.wikimedia.org/r/354212 (owner: 10Alexandros Kosiaris) [11:37:24] (03PS1) 10Giuseppe Lavagetto: Workaround: use locally installed glide binary [calico-cni] - 10https://gerrit.wikimedia.org/r/354213 [11:38:55] (03CR) 10Giuseppe Lavagetto: [C: 032] "I will find a better solution later when we use stretch as well." [calico-cni] - 10https://gerrit.wikimedia.org/r/354213 (owner: 10Giuseppe Lavagetto) [11:38:58] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Workaround: use locally installed glide binary [calico-cni] - 10https://gerrit.wikimedia.org/r/354213 (owner: 10Giuseppe Lavagetto) [11:53:41] (03CR) 10ArielGlenn: [C: 031] "Better than nothing; at least it will cover some failure cases." [software/dbtree] - 10https://gerrit.wikimedia.org/r/353388 (https://phabricator.wikimedia.org/T163143) (owner: 10Dzahn) [12:06:37] 06Operations, 10ops-eqiad: rack and setup 24 parsoid servers - https://phabricator.wikimedia.org/T165520#3273162 (10Cmjohnson) Servers are racked 3 per rack and 6 per row. [12:31:31] 06Operations, 07HHVM, 07Upstream: HHVM: Crash in server worker - https://phabricator.wikimedia.org/T165669#3273206 (10MoritzMuehlenhoff) [12:40:42] (03PS1) 10Marostegui: db-eqiad.php: Repool db1076, depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354218 (https://phabricator.wikimedia.org/T159753) [12:42:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1076, depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354218 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [12:42:51] !log upgrading mw1161 (job runner) to HHVM 3.18 [12:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:16] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1076, depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354218 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [12:43:28] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1076, depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354218 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [12:44:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1076, depool db1074 - T159753 T164530 (duration: 00m 39s) [12:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:24] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [12:44:24] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [12:44:29] !log Deploy alter table s2.revision table - db1074 - https://phabricator.wikimedia.org/T162611 [12:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:08] (03PS1) 10Marostegui: db-eqiad.php: Repool db1072, depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354220 (https://phabricator.wikimedia.org/T159753) [12:48:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1072, depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354220 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [12:49:22] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1072, depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354220 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [12:49:31] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1072, depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354220 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [12:50:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072, depool db1066 - T159753 T164530 (duration: 00m 38s) [12:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:29] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [12:50:30] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [12:51:27] !log upgrading mw1209-mw1219 to Linux 4.9 and HHVM 3.18 [12:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:42] (03PS1) 10Marostegui: db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354221 (https://phabricator.wikimedia.org/T159753) [12:55:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354221 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [12:56:40] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354221 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [12:56:49] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354221 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [12:57:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1066 - T159753 T164530 (duration: 00m 38s) [12:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:44] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [12:57:44] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170518T1300). [13:00:25] 07Puppet, 10DBA, 10Monitoring, 07Documentation, 13Patch-For-Review: Document performance optimization of servermon and/or puppet reporting tools - https://phabricator.wikimedia.org/T164604#3273290 (10akosiaris) The patch above does document everything in the servermon.rb reporter (which is the applicatio... [13:03:36] 07Puppet, 10DBA, 10Monitoring, 07Documentation, 13Patch-For-Review: Document performance optimization of servermon and/or puppet reporting tools - https://phabricator.wikimedia.org/T164604#3273297 (10jcrespo) Let's close this as the scope is for me done, and let's open a new ticket with lower priority wi... [13:14:08] 06Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674#3273365 (10akosiaris) p:05Triage>03Lowest [13:14:19] !log reloaded kafkatee to test T151748 [13:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:28] T151748: Cron conflict for kafkatee logrotate on oxygen - https://phabricator.wikimedia.org/T151748 [13:14:56] !log AMEND prev: reloaded kafkatee on oxygen [13:15:00] 07Puppet, 10DBA, 10Monitoring, 07Documentation, 13Patch-For-Review: Document performance optimization of servermon and/or puppet reporting tools - https://phabricator.wikimedia.org/T164604#3239514 (10akosiaris) 05Open>03Resolved a:03akosiaris Agreed. Relevant stuff copied over to T165674 (marked lo... [13:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:48] 06Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674#3273350 (10akosiaris) Note that this is occuring seldomly and not causing any issues whatsoever. It's mostly out of personal interest that we are looking into this, hence the very low priority. [13:16:51] (03PS3) 10Muehlenhoff: Add initial class for ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) [13:20:14] (03PS1) 10Muehlenhoff: Add ferm service for rpc.statd on labstore [puppet] - 10https://gerrit.wikimedia.org/r/354226 (https://phabricator.wikimedia.org/T165136) [13:20:37] 07Puppet, 10DBA, 10Monitoring, 07Documentation, 13Patch-For-Review: Document performance optimization of servermon and/or puppet reporting tools - https://phabricator.wikimedia.org/T164604#3273391 (10jcrespo) The "separate task" I usually suggest in these cases has a double reason- it makes clear the sco... [13:32:21] (03PS1) 10Marostegui: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354228 (https://phabricator.wikimedia.org/T159753) [13:33:09] !log stopping mariadb and preparing for reimage at db2051 [13:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354228 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [13:35:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354228 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [13:35:49] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354228 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [13:40:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 - T159753 T164530 (duration: 01m 03s) [13:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:40] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [13:40:40] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [13:42:17] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354229 [13:44:44] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354229 (owner: 10Marostegui) [13:45:11] PROBLEM - Host kubernetes2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:45:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354229 (owner: 10Marostegui) [13:45:50] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354229 (owner: 10Marostegui) [13:45:53] PROBLEM - Host kubernetes2004 is DOWN: PING CRITICAL - Packet loss = 100% [13:45:53] PROBLEM - Host kubernetes2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:46:01] RECOVERY - Host kubernetes2001 is UP: PING OK - Packet loss = 0%, RTA = 1.33 ms [13:46:01] RECOVERY - Host kubernetes2004 is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [13:46:01] RECOVERY - Host kubernetes2002 is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [13:46:18] those are not production, right? [13:47:00] <_joe_> right [13:47:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 - T159753 T164530 (duration: 00m 39s) [13:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:10] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [13:47:10] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [13:48:01] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [13:49:10] mmmm [13:49:26] 1210/11 complaining about redis? [13:49:29] <_joe_> elukey: you taking a look? [13:49:48] yep, seems a brief spike [13:49:50] from https://logstash.wikimedia.org/app/kibana#/dashboard/memcached [13:49:53] 06Operations, 10Traffic, 10netops: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#3273427 (10BBlack) I'm probably backtracking into territory that was once known here, but after the long delay I felt I had to go back and re-validate what's going on with the port... [13:50:19] that's fine, these are depooled for a reboot and the delayed nutcracker leads to false positives [13:50:26] I was about to ask [13:50:27] super [13:50:41] <_joe_> moritzm: you didn't merge your change? [13:50:41] I'd appreciate a followup review of https://gerrit.wikimedia.org/r/#/c/353556/ [13:50:42] cool, then [13:51:01] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [13:51:03] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3273428 (10akosiaris) @Papaul the mistake was clearly in the partman recipe. Fixed in https://gerrit.wikimedia.org/r/#/c/354209/ and the boxes are up and running... [13:51:04] no, revised my patch after reading up on the systemd.unit docs [13:51:22] will add it to deployment-prep for some tests later on or tomorrow [13:51:24] <_joe_> moritzm: also, I'm thinking [13:51:38] <_joe_> we might want to have hhvm and not nutcracker [13:52:17] <_joe_> so in terms of puppet code, we might want to do it a bit differently? [13:52:34] <_joe_> I have to think about it [13:52:37] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3273431 (10akosiaris) [13:52:39] hhvm is enabled for service startup at boot, nutcracker is the problem [13:52:54] <_joe_> should we just enable nutcracker as well? [13:53:02] see https://phabricator.wikimedia.org/T163795#3254215 [13:53:12] we should do both, enable nutcracker for startup [13:53:26] (03PS1) 10Alexandros Kosiaris: Assign roles to kubernetes200X hosts [puppet] - 10https://gerrit.wikimedia.org/r/354230 (https://phabricator.wikimedia.org/T164851) [13:53:33] and my service dependency patch [13:53:48] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Assign roles to kubernetes200X hosts [puppet] - 10https://gerrit.wikimedia.org/r/354230 (https://phabricator.wikimedia.org/T164851) (owner: 10Alexandros Kosiaris) [13:53:56] <_joe_> yeah, I'm just saying we should find a better way to add the dependency [13:54:37] if the dependency is declared, systemd will also sort it correctly during boot startup [13:55:51] I'll test this with with various scenarios in deployment-prep, but I'm fairly sure it's the correct way to declare those. but further comments/review appreciated [13:57:11] PROBLEM - DPKG on kubernetes2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:57:12] <_joe_> yeah my point is that we might have cases where we install hhvm and not nutcracker [13:58:11] RECOVERY - DPKG on kubernetes2004 is OK: All packages OK [13:59:51] PROBLEM - puppet last run on kubernetes2004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata] [13:59:51] PROBLEM - puppet last run on kubernetes2002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata] [14:00:02] PROBLEM - puppet last run on kubernetes2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata] [14:00:11] PROBLEM - puppet last run on kubernetes2003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 25 seconds ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata] [14:00:19] expected ^ [14:03:04] _joe_: hmm, good point, we in fact have such a case (osmium) [14:03:49] (03CR) 10Muehlenhoff: [C: 04-1] "Needs to be revised, we have at least one server running HHVM which doesn't use nutcracker (osmium)" [puppet] - 10https://gerrit.wikimedia.org/r/353556 (https://phabricator.wikimedia.org/T163795) (owner: 10Muehlenhoff) [14:08:19] ACKNOWLEDGEMENT - puppet last run on kubernetes2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Still fixing the CNI package issue [14:08:19] ACKNOWLEDGEMENT - puppet last run on kubernetes2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Still fixing the CNI package issue [14:08:19] ACKNOWLEDGEMENT - puppet last run on kubernetes2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Still fixing the CNI package issue [14:08:19] ACKNOWLEDGEMENT - puppet last run on kubernetes2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Still fixing the CNI package issue [14:12:31] !log perform a final reboot on kubernetes200X [14:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:16] PROBLEM - Host kubernetes2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:26] RECOVERY - Host kubernetes2003 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [14:16:36] PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:16:46] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:17:16] PROBLEM - Check systemd state on kubernetes2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:17:16] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:17:17] damned docker [14:17:32] "Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed [14:20:36] ACKNOWLEDGEMENT - Check systemd state on kubernetes2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. alexandros kosiaris still bringing the service online [14:20:37] ACKNOWLEDGEMENT - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. alexandros kosiaris still bringing the service online [14:20:37] ACKNOWLEDGEMENT - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. alexandros kosiaris still bringing the service online [14:20:37] ACKNOWLEDGEMENT - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. alexandros kosiaris still bringing the service online [14:23:32] 06Operations: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703#3273472 (10MoritzMuehlenhoff) None of the packages removed for 8.8 were present in our environment. These are fully rolled out: logback irqbalance libdatetime-timezone-perl wget vim groovy [14:32:05] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3273486 (10Papaul) @akosiaris Thanks will resume the install. [14:32:36] !log rebooting mr1-ulsfo for software upgrade - T164970 [14:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:44] T164970: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970 [14:35:36] ACKNOWLEDGEMENT - Host asw-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi upgrading mr1-ulsfo [14:45:56] 06Operations, 10Traffic: Define 3-host infra cluster for traffic pops - https://phabricator.wikimedia.org/T96852#3273494 (10BBlack) The tentative and limited plan for now is to deploy 3x misc/infra hosts (meaning all the hosts other than lvs and cp) at each cache site and not use virtualization. We might revi... [14:46:09] 06Operations, 10ops-ulsfo, 06DC-Ops, 10netops: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3273495 (10ayounsi) 05Open>03Resolved [14:46:53] !log rebooting restbase1008 for update to Linux 4.9 and to pick up OpenJDK security updates [14:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:26] PROBLEM - Apache HTTP on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:27] PROBLEM - puppet last run on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:36] PROBLEM - Check systemd state on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:45] ^downtime exprited, fixing [14:58:46] PROBLEM - nutcracker port on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:46] PROBLEM - Disk space on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:46] PROBLEM - Check whether ferm is active by checking the default input chain on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:46] PROBLEM - Check size of conntrack table on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:47] PROBLEM - HHVM processes on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:47] PROBLEM - configured eth on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:47] PROBLEM - nutcracker process on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:56] PROBLEM - HHVM rendering on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:56] PROBLEM - DPKG on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:56] PROBLEM - dhclient process on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:59:06] PROBLEM - SSH on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:59:16] PROBLEM - Nginx local proxy to apache on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:59:16] PROBLEM - salt-minion processes on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:00:07] RECOVERY - salt-minion processes on mw1219 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:00:16] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:00:26] RECOVERY - Check systemd state on mw1219 is OK: OK - running: The system is fully operational [15:00:36] RECOVERY - Disk space on mw1219 is OK: DISK OK [15:00:36] RECOVERY - Check size of conntrack table on mw1219 is OK: OK: nf_conntrack is 0 % full [15:00:36] RECOVERY - Check whether ferm is active by checking the default input chain on mw1219 is OK: OK ferm input default policy is set [15:00:36] RECOVERY - configured eth on mw1219 is OK: OK - interfaces up [15:00:37] RECOVERY - HHVM processes on mw1219 is OK: PROCS OK: 6 processes with command name hhvm [15:00:46] RECOVERY - dhclient process on mw1219 is OK: PROCS OK: 0 processes with command name dhclient [15:00:46] RECOVERY - DPKG on mw1219 is OK: All packages OK [15:00:56] RECOVERY - SSH on mw1219 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [15:01:06] RECOVERY - Nginx local proxy to apache on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 5.838 second response time [15:01:16] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.916 second response time [15:02:46] RECOVERY - HHVM rendering on mw1219 is OK: HTTP OK: HTTP/1.1 200 OK - 74782 bytes in 0.299 second response time [15:07:01] (03PS1) 10Marostegui: db-eqiad.php: Repool db1074, depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354237 (https://phabricator.wikimedia.org/T162611) [15:07:15] (03PS2) 10Marostegui: db-eqiad.php: Repool db1074, depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354237 (https://phabricator.wikimedia.org/T162611) [15:07:34] (03CR) 10Marostegui: [C: 04-2] "Wait for maintenance on db1074 finish" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354237 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [15:18:00] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T165629#3273518 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [15:18:26] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T165629#3273519 (10Marostegui) Thanks! [15:28:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1074, depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354237 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [15:31:31] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1074, depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354237 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [15:31:43] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1074, depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354237 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [15:32:24] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3273537 (10Papaul) a:05Papaul>03akosiaris [15:34:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1074, depool db1060 - T162611 (duration: 00m 39s) [15:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:16] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [15:34:36] RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational [15:34:48] !log Deploy alter table s2.revision table - db1060 - https://phabricator.wikimedia.org/T162611 [15:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:04] (03CR) 10Eevans: "> Thanks a lot, looks great! Just a couple of questions/notes for" [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto) [15:37:36] PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:47:06] (03CR) 10Elukey: "> Sorry, I'm still trying to construct the mental model of how" [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto) [15:48:16] 06Operations, 13Patch-For-Review, 15User-Elukey: Cron conflict for kafkatee logrotate on oxygen - https://phabricator.wikimedia.org/T151748#3273551 (10elukey) [15:54:19] <_joe_> !log uploaded package cni to jessie-wikimedia [15:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:26] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:55:55] <_joe_> paravoid: ^^ as promised :P [15:59:00] heh [15:59:16] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [16:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170518T1600). Please do the needful. [16:00:44] there seems to be no patches to merge afaics [16:02:16] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:04:16] (03CR) 10Eevans: "> > It is setting it to 25165824 (in the cassandra profile), no?" [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto) [16:04:26] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [16:04:36] RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational [16:07:36] PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:07:52] 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3273597 (10elukey) Apache's mod_ssl seems to default to not expecting any response after sending a close notify: ``` # SSL Protocol Adjustments: # The safe and defau... [16:11:25] !log upgraded cassandra-tools-wmf on aqs hosts [16:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:25] RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:18:25] RECOVERY - puppet last run on kubernetes1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:24:33] 06Operations, 10Phabricator, 13Patch-For-Review, 06Release-Engineering-Team (Watching / External): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3273662 (10RobH) This was already installed and has puppet/salt accepted, seems the ticket just got neglected. @Paladox: You had... [16:24:47] 06Operations, 10Phabricator, 06Release-Engineering-Team (Watching / External): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3273663 (10RobH) a:05RobH>03None [16:25:28] 06Operations, 10Phabricator, 06Release-Engineering-Team (Watching / External): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3273668 (10Paladox) @RobH hi, that would be releng (@mmodell) for service implementation. [16:27:08] 06Operations, 10Phabricator, 06Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3273670 (10RobH) a:03mmodell @Paladox: Thanks! @mmodell: It looks like we got this system spun up and installed awhile ago. I've assigned this task to you for s... [16:28:47] !log restarting cassandra on restbase1010, restbase1011, restbase1016, restbase1018 to pick up OpenJDK security updates [16:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:08] 06Operations, 10ops-codfw: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3273679 (10RobH) [16:33:31] 06Operations: Audit / document reasons for not enabling HT? - https://phabricator.wikimedia.org/T165618#3273689 (10jcrespo) BBlack - I am not disagreeing with you, in fact we already do throughput limitation at application side by limiting thread concurrency to 64 on our servers, which is more than the number of... [16:44:56] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3273706 (10elukey) @Krinkle, @aaron - any opinion? There are a couple of hosts that might be better to decom since the hw is really hold (like rdb1007),... [16:49:54] (03PS1) 10Hoo man: Log "api-readonly" errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354246 (https://phabricator.wikimedia.org/T164191) [16:53:12] (03Draft1) 10Paladox: Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) [16:53:15] (03PS2) 10Paladox: Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170518T1700). [17:00:17] no parsoid deployment today [17:02:15] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [17:02:41] (03PS3) 10Paladox: Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) [17:03:35] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 36, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/0: down - OOB-transit: UnitedLayer OOB connection (UL CID: 0502) [100Mbps Cu]BR [17:09:15] !log upgrading mw2130-mw2139 to Linux 4.9 and HHVM 3.18 [17:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:45] mr1 outage is expected. [17:10:11] !log mr1-ulsfo having oob connection re-routed at ulsfo, will flap a bit from 1700-1730 gmt [17:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:35] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 [17:20:38] 06Operations, 10DBA, 10Pybal, 07Availability: Create a backend check for pybal to monitor the MySQL protocol being up - https://phabricator.wikimedia.org/T165677#3273775 (10jcrespo) [17:23:15] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 55.71 ms [17:27:15] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [17:28:15] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [17:32:20] (03Abandoned) 10Brion VIBBER: Make MediaWiki treat $lang of be_x_oldwiki as be-tarask, just don't change the real DB name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236966 (https://phabricator.wikimedia.org/T111853) (owner: 10Alex Monk) [17:32:28] (03Restored) 10Brion VIBBER: Make MediaWiki treat $lang of be_x_oldwiki as be-tarask, just don't change the real DB name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236966 (https://phabricator.wikimedia.org/T111853) (owner: 10Alex Monk) [17:33:57] (03Abandoned) 10Brion VIBBER: Disable mp3 uploads for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349733 (https://phabricator.wikimedia.org/T115170) (owner: 10TheDJ) [17:39:15] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [17:41:15] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [17:57:25] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:57:45] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:35] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 26 minutes ago with 0 failures [18:00:00] jynus: https://gerrit.wikimedia.org/r/#/c/354138/ [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170518T1800). [18:00:05] Jamesofur: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:52] AaronSchulz, please try an op in a better timezone, I was finishing and about to leave [18:01:28] (03CR) 10Jcrespo: [C: 031] Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [18:01:43] it is 20h here [18:02:00] jynus: well, I wanted you to sign off before pinging anyone else at least [18:02:09] I just did [18:03:13] please keep an eye on graphite after deploying [18:04:19] * Jamesofur is here [18:19:45] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational [18:22:45] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:25:26] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:26:10] Hello [18:26:34] Jamesofur: I can deploy [18:27:18] I've just reached the hackathon hotel, and I were going to go to metalab, but I can deploy this first. [18:29:58] Dereckson: \o/ [18:30:14] shouldn't be long [18:30:28] famous last words [18:30:48] true story [18:32:18] Jamesofur: live on mwdebug1002 [18:36:43] Dereckson: looks good [18:36:47] ok [18:37:37] !log dereckson@tin Synchronized php-1.30.0-wmf.1/extensions/SecurePoll/includes/pages/DumpPage.php: Revert "Dump should return decrypted votes" (T145695) (duration: 00m 48s) [18:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:47] T145695: Dump should return decrypted votes - https://phabricator.wikimedia.org/T145695 [18:38:07] here you are. [18:38:39] and works on live [18:38:40] thanks :) [18:47:35] PROBLEM - configured eth on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:25] RECOVERY - configured eth on ms-be1020 is OK: OK - interfaces up [19:04:36] RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational [19:06:31] !log T164865: configure RESTBase tables for size-tiered compaction (dev env only) [19:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:39] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [19:07:35] PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:35:12] (03PS2) 10Dzahn: wikistats: grant db permissions on first run (labs) [puppet] - 10https://gerrit.wikimedia.org/r/353944 [19:41:11] (03PS3) 10Dzahn: wikistats: grant db permissions on first run (labs) [puppet] - 10https://gerrit.wikimedia.org/r/353944 [19:52:38] !log T164865: restarting RESTBase-dev to apply range delete-based render retention [19:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:46] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [20:03:52] !log T164865: restarting RESTBase-dev, range delete-based render retention [20:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:00] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [20:11:20] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3274118 (10Cmjohnson) HP is sending me a new battery and wants me to upgrade the f/w. Part/s shipped: 871264-001 Part description: SPS-BATT PACK MC 96W V3 Carrier... [20:18:52] (03CR) 10Dzahn: [C: 032] wikistats: grant db permissions on first run (labs) [puppet] - 10https://gerrit.wikimedia.org/r/353944 (owner: 10Dzahn) [20:19:13] (03CR) 10Dzahn: [C: 032] "labs-only" [puppet] - 10https://gerrit.wikimedia.org/r/353944 (owner: 10Dzahn) [20:39:19] (03CR) 10Dzahn: [C: 032] return HTTP 503 if database connection fails [software/dbtree] - 10https://gerrit.wikimedia.org/r/353388 (https://phabricator.wikimedia.org/T163143) (owner: 10Dzahn) [20:39:59] (03CR) 10Dzahn: [V: 032 C: 032] return HTTP 503 if database connection fails [software/dbtree] - 10https://gerrit.wikimedia.org/r/353388 (https://phabricator.wikimedia.org/T163143) (owner: 10Dzahn) [20:44:24] !log terbium / dbtree - deploying gerrit:353388 (sudo -u mwdeploy git pull origin in /srv/dbtree) (T163143) [20:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:32] T163143: dbtree: don't return 200 on error pages - https://phabricator.wikimedia.org/T163143 [20:47:03] !log wasat - git pull - bring to latest, the last changed had never been deployed here like on terbium, but it's also not a backend for dbtree yet (T163141) [20:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:11] T163141: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141 [21:04:29] hey addshore are you busy? [21:13:44] (03CR) 10Dzahn: [C: 032] gerrit: rename "server" IP to "service" IP [puppet] - 10https://gerrit.wikimedia.org/r/354075 (owner: 10Dzahn) [21:15:24] (03PS2) 10Dzahn: gerrit: rename "server" IP to "service" IP [puppet] - 10https://gerrit.wikimedia.org/r/354075 [21:19:08] (03PS3) 10Dzahn: gerrit: rename "server" IP to "service" IP [puppet] - 10https://gerrit.wikimedia.org/r/354075 [21:20:03] 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3274227 (10BBlack) So from the above, apache really has 3 different modes of operation: 1) default - sends close notify, but does not wait for a matching close notify fr... [21:29:03] (03PS4) 10Dzahn: gerrit: rename "server" IP to "service" IP [puppet] - 10https://gerrit.wikimedia.org/r/354075 [21:45:14] 06Operations, 10ops-eqiad: rack and setup 24 parsoid servers - https://phabricator.wikimedia.org/T165520#3274287 (10RobH) [21:48:46] 06Operations, 10ops-eqiad: rack and setup 24 parsoid servers - https://phabricator.wikimedia.org/T165520#3274294 (10RobH) p:05Triage>03Normal a:03Cmjohnson [21:49:45] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational [21:52:45] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:18:36] 07Puppet, 06Labs, 10MediaWiki-Vagrant, 13Patch-For-Review, 15User-bd808: Make role::labs::mediawiki_vagrant work on Debian Jessie host systems - https://phabricator.wikimedia.org/T154340#3274360 (10bd808) 05Open>03Resolved This has been functional for several months. I think I just lost track of the... [22:29:15] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [22:32:15] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:36:08] (03CR) 10Dzahn: [C: 032] "i removed the "profile::" part and it also compiles as no-op now http://puppet-compiler.wmflabs.org/6485/" [puppet] - 10https://gerrit.wikimedia.org/r/354075 (owner: 10Dzahn) [22:36:16] (03PS5) 10Dzahn: gerrit: rename "server" IP to "service" IP [puppet] - 10https://gerrit.wikimedia.org/r/354075 [22:49:45] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational [22:52:45] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170518T2300). [23:04:35] RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational [23:07:36] PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:09:48] Nothing to SWAT.