[00:02:17] RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:13:07] PROBLEM - Disk space on kafka1020 is CRITICAL: DISK CRITICAL - free space: / 1060 MB (3% inode=96%) [00:53:28] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [00:54:37] (03PS2) 10Tim Landscheidt: apt: Remove extra space in sources.list [puppet] - 10https://gerrit.wikimedia.org/r/263380 [00:55:45] (03PS2) 10Tim Landscheidt: dynamicproxy: Use lua-json package instead of liblua5.1-json [puppet] - 10https://gerrit.wikimedia.org/r/263230 [00:56:49] (03PS2) 10Tim Landscheidt: Tools: Unpuppetize host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/241582 (https://phabricator.wikimedia.org/T109485) [00:57:47] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [01:05:48] (03CR) 10Yuvipanda: [C: 032] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/263230 (owner: 10Tim Landscheidt) [01:08:27] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [01:14:38] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [01:22:08] PROBLEM - Disk space on kafka1020 is CRITICAL: DISK CRITICAL - free space: / 1059 MB (3% inode=96%) [01:24:59] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [24.0] [01:27:08] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [01:45:17] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:46:29] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:49:27] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:50:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:23:21] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 09s) [02:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:13] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jan 25 02:30:13 UTC 2016 (duration 6m 52s) [02:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:38] PROBLEM - Disk space on kafka1020 is CRITICAL: DISK CRITICAL - free space: / 1059 MB (3% inode=96%) [03:02:32] !log tstarling@tin Synchronized php-1.27.0-wmf.10/includes/parser/ParserOutput.php: (no message) (duration: 00m 27s) [03:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:03:22] !log tstarling@tin Synchronized php-1.27.0-wmf.10/includes/parser/ParserCache.php: (no message) (duration: 00m 25s) [03:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:58] PROBLEM - HHVM rendering on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:07:07] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:26] !log tstarling@tin Synchronized php-1.27.0-wmf.10/includes/parser/ParserCache.php: (no message) (duration: 00m 25s) [03:43:08] PROBLEM - Disk space on kafka1020 is CRITICAL: DISK CRITICAL - free space: / 1061 MB (3% inode=96%) [04:16:28] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [04:20:58] PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=82%) [04:54:19] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0 [05:00:48] PROBLEM - Disk space on kafka1020 is CRITICAL: DISK CRITICAL - free space: / 1062 MB (3% inode=96%) [05:59:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [06:13:10] (03PS1) 10Andrew Bogott: Don't send puppet nags to the novaadmin user. [puppet] - 10https://gerrit.wikimedia.org/r/266192 (https://phabricator.wikimedia.org/T124516) [06:22:37] PROBLEM - Disk space on kafka1020 is CRITICAL: DISK CRITICAL - free space: / 1060 MB (3% inode=96%) [06:26:57] RECOVERY - Disk space on iridium is OK: DISK OK [06:31:18] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:18] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:58] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:47] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:27] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: puppet fail [06:40:19] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#1961009 (10EBernhardson) [06:40:29] 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1960955 (10EBernhardson) [06:56:47] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:56:59] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:57:47] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:58:28] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:47] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:09:06] disk space issues on a bunch of kafka machines, it looks like /var/log/kafka/kafkaServer-gc.log needs to be rotated [07:12:09] <_joe_> ema: hi [07:12:30] good morning! [07:12:32] <_joe_> ema: that's strange, somehow [07:13:54] <_joe_> 2015-11-10T20:26 [07:13:55] <_joe_> lol [07:14:04] <_joe_> yeah it seems like it could use a rotation :) [07:14:06] yep [07:23:33] <_joe_> !log restarting hhvm on mw1143, stuck into HPHP::SynchronizableMulti::waitImpl (__pthread_cond_wait) [07:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:24:38] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.165 second response time [07:24:42] <_joe_> !log restarting hhvm on mw1148, stuck in HPHP::Treadmill::startRequest (__lll_lock_wait) [07:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:25:17] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 69689 bytes in 1.172 second response time [07:26:28] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.137 second response time [07:26:28] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 69689 bytes in 0.493 second response time [07:34:49] <_joe_> !log rebooting alsafi, unresponsive to ssh [07:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:37:49] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [07:38:07] RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient [07:38:08] RECOVERY - Disk space on alsafi is OK: DISK OK [07:38:08] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [07:38:08] RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:38:08] RECOVERY - DPKG on alsafi is OK: All packages OK [07:38:08] RECOVERY - RAID on alsafi is OK: OK: no RAID installed [07:38:30] still with alsafi? [07:40:09] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:40:58] <_joe_> paravoid: it's the "ganeti/kvm bug [07:41:09] <_joe_> that alex has been trying to fix since some time, I guess [07:41:13] well we had a workaround for that [07:41:17] aio=native I believe [07:41:21] I'll check with him today [07:41:27] <_joe_> thanks :) [07:42:06] <_joe_> btw, I scheduled downtime for stat1002, since it's running a gazillion perl scripts and has a load average of 1300 [07:42:11] <_joe_> I'll open a ticket [07:44:06] 1300? [07:44:20] <_joe_> yes [07:44:26] how many perl scripts? :) [07:44:39] PROBLEM - Disk space on kafka1020 is CRITICAL: DISK CRITICAL - free space: / 1056 MB (3% inode=96%) [07:44:56] <_joe_> 1263.89 atm [07:45:04] <_joe_> uhm, I don't know for sure [07:45:05] (good week to you too kafka1020) [07:45:17] nagios 1109 0.0 0.0 7136 908 ? D Jan24 0:00 /bin/ps axwwo stat uid pid ppid vsz rss p [07:45:19] <_joe_> paravoid: it's the GC log that never got rotated on all kafka machines [07:45:20] nagios 1122 0.0 0.0 7136 908 ? D 02:28 0:00 /bin/ps axwwo stat uid pid ppid vsz rss p [07:45:23] nagios 1134 0.0 0.0 7136 908 ? D Jan24 0:00 /bin/ps axwwo stat uid pid ppid vsz rss p [07:45:30] smells like something else [07:45:39] <_joe_> paravoid: uh ps not working? [07:45:49] <_joe_> yeah sounds bad [07:45:56] tons of all that [07:45:59] <_joe_> top works btw [07:46:13] probably fuse again? [07:46:22] yup [07:46:24] <_joe_> probably that [07:46:25] root@stat1002:~# ls /mnt/hdfs [07:46:29] <_joe_> yeah [07:46:30] <_joe_> sigh [07:47:30] !log stat1002: umount -f /mnt/hdfs [07:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:47:58] RECOVERY - dhclient process on stat1002 is OK: PROCS OK: 0 processes with command name dhclient [07:48:37] RECOVERY - salt-minion processes on stat1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:48:37] RECOVERY - Disk space on stat1002 is OK: DISK OK [07:49:02] <_joe_> I always forget about fuse [07:50:53] it was mounted again, probably by puppet [07:51:08] seems to work [07:52:24] so, kafka? [07:53:09] <_joe_> kafka has a very large (~20 G) gc log file, because the kafka startup doesn't define a rotation policy for it [07:53:31] <_joe_> so ema was suggesting we fix that and do a rolling restart of kafka [07:53:48] akosiaris (with your clinic duty hat on): ops calendar is missing all those telia notices (I noticed because a wave is down and I was wondering if it's scheduled or not -- it is) [07:53:52] <_joe_> but neither of us is familiar with it, so we need to read a bit about how to operate the cluster [07:54:22] <_joe_> I'm fixing other things while he was having breakfast :) [07:55:19] (03PS1) 10Giuseppe Lavagetto: scap: re-add mw2039 [puppet] - 10https://gerrit.wikimedia.org/r/266197 (https://phabricator.wikimedia.org/T124282) [07:55:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] scap: re-add mw2039 [puppet] - 10https://gerrit.wikimedia.org/r/266197 (https://phabricator.wikimedia.org/T124282) (owner: 10Giuseppe Lavagetto) [07:56:13] I made a rolling restart of the kafka brokers for the last java update, it's now documented here: https://wikitech.wikimedia.org/wiki/Service_restarts#Kafka_brokers [07:56:47] RECOVERY - NTP on alsafi is OK: NTP OK: Offset -0.0001405477524 secs [07:58:18] ok, I need to bbl [07:58:38] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 120, down: 0, dormant: 0, excluded: 1, unused: 0 [08:02:47] RECOVERY - NTP on mw2039 is OK: NTP OK: Offset -0.0003932714462 secs [08:03:48] RECOVERY - mediawiki-installation DSH group on mw2039 is OK: OK [08:04:28] RECOVERY - salt-minion processes on ruthenium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:13:18] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:22:47] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:24:28] PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:36:47] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:42:03] (03PS2) 10Giuseppe Lavagetto: role::deployment: add a warning on the inactive server [puppet] - 10https://gerrit.wikimedia.org/r/265515 [08:47:03] 6operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#1961165 (10MoritzMuehlenhoff) Hmm, the package has been touched since 2014. There's a wishlist bug to upgrade it (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=804633), but that hasn't seen a followup either... [08:48:08] (03PS3) 10Giuseppe Lavagetto: role::deployment: add a warning on the inactive server [puppet] - 10https://gerrit.wikimedia.org/r/265515 [08:50:07] RECOVERY - salt-minion processes on mira is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:52:38] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet has 1 failures [08:54:38] PROBLEM - Disk space on kafka1020 is CRITICAL: DISK CRITICAL - free space: / 1055 MB (3% inode=96%) [09:05:17] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 606 [09:11:57] (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment: add a warning on the inactive server [puppet] - 10https://gerrit.wikimedia.org/r/265515 (owner: 10Giuseppe Lavagetto) [09:15:17] RECOVERY - check_mysql on db1008 is OK: Uptime: 495415 Threads: 2 Questions: 3900342 Slow queries: 3168 Opens: 1327 Flush tables: 2 Open tables: 399 Queries per second avg: 7.872 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:18:09] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:19:01] 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1961204 (10akosiaris) >>! In T74109#1959212, @Base wrote: > Is there/Would there be a way to updating just translations? A one that is maintainable, makes sense in the long run and tha... [09:22:17] RECOVERY - Disk space on kafka1020 is OK: DISK OK [09:22:18] 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1961206 (10akosiaris) >>! In T74109#1959235, @pajz wrote: [snip] > I believe the best strategy is to have someone set up an account on Transifex and fix these mistakes as they are broug... [09:25:37] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0 [09:27:09] !log installed fuse security update on labnodepool1001 (the other fuse installations are on Ubuntu, which doesn't ship the udev rule, but uses mountall instead) [09:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:30:12] (03PS1) 10Ema: Limit GCLog file size [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266203 [09:31:58] <_joe_> !log rolling reboot of the eqiad appserver cluster [09:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:37:12] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Limit GCLog file size (031 comment) [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266203 (owner: 10Ema) [09:41:03] (03PS1) 10Giuseppe Lavagetto: role::deployment: fix heredoc quoting for inactive motd [puppet] - 10https://gerrit.wikimedia.org/r/266204 [09:41:29] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::deployment: fix heredoc quoting for inactive motd [puppet] - 10https://gerrit.wikimedia.org/r/266204 (owner: 10Giuseppe Lavagetto) [09:42:27] PROBLEM - Host mw1210 is DOWN: PING CRITICAL - Packet loss = 100% [09:43:18] RECOVERY - Host mw1210 is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [09:45:27] PROBLEM - Host mw1065 is DOWN: PING CRITICAL - Packet loss = 100% [09:46:29] (03PS2) 10Ema: Limit GCLog file size [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266203 [09:46:58] RECOVERY - Host mw1065 is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [09:47:02] (03CR) 10Giuseppe Lavagetto: [C: 031] Limit GCLog file size [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266203 (owner: 10Ema) [09:54:17] PROBLEM - Host mw1038 is DOWN: PING CRITICAL - Packet loss = 100% [09:54:17] PROBLEM - Host mw1172 is DOWN: PING CRITICAL - Packet loss = 100% [09:54:58] RECOVERY - Host mw1038 is UP: PING OK - Packet loss = 0%, RTA = 1.25 ms [09:55:56] (03PS1) 10Giuseppe Lavagetto: deployment: switch to mira [dns] - 10https://gerrit.wikimedia.org/r/266206 [09:56:12] !log limiting GCLogFileSize and restarting kafka on kafka1012 [09:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:56:33] <_joe_> ema: do that with puppet? [09:56:48] <_joe_> as in, merge your change and let puppet take care of things? [09:56:58] <_joe_> or you do need to truncate that file first? [09:57:28] _joe_: apparently by adding NumberOfGCLogFiles the file does not get rotated properly :( [09:57:45] <_joe_> wat? [09:57:56] -rw-r--r-- 1 kafka kafka 21G Jan 25 09:55 kafkaServer-gc.log [09:57:56] -rw-r--r-- 1 kafka kafka 106K Jan 25 09:57 kafkaServer-gc.log.0.current [09:58:09] PROBLEM - Host mw1176 is DOWN: PING CRITICAL - Packet loss = 100% [09:58:12] <_joe_> uhm is the file still in use by the java process? [09:58:22] <_joe_> I suspect it's not [09:59:01] it is not [09:59:16] <_joe_> it's the naming scheme that changes [09:59:25] <_joe_> java != unix :) [09:59:28] RECOVERY - Host mw1176 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [10:02:59] (03PS1) 10Giuseppe Lavagetto: deployment: switch to mira as the active deployment server [puppet] - 10https://gerrit.wikimedia.org/r/266208 [10:03:34] _joe_: anyways, I assume it's fine to merge the change and then truncate the huge files by hand? [10:03:37] 6operations: Metrics not reaching Graphite - https://phabricator.wikimedia.org/T124639#1961302 (10Peter) 3NEW a:3Ottomata [10:03:40] <_joe_> yes [10:04:08] RECOVERY - Disk space on kafka1012 is OK: DISK OK [10:04:25] the JVM did the weird thing then, starting correctly the new log rotation scheme but not dropping the old log [10:04:35] <_joe_> why should it " [10:04:39] (03CR) 10Ema: [C: 032 V: 032] Limit GCLog file size [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266203 (owner: 10Ema) [10:04:39] <_joe_> drop" it? [10:04:52] <_joe_> it has no knowledge of the previously defined logfile [10:05:07] <_joe_> anyways, I'm pretty sure there is a way to tell a running JVM to rotate a logfile [10:05:14] <_joe_> but I don't remember that offhand [10:05:19] _joe_ my understanding is that the JVM restawikpespes the FGGClog s on restart [10:05:38] <_joe_> ? [10:05:47] <_joe_> restawikpespes means? [10:05:56] _joe_: how to do a proper puppet-merge now? kafka is a git submodule apparently [10:05:56] sorry mosh lost the connection [10:05:59] <_joe_> eheh [10:06:21] _joe_: I was saying that I thought the JVM wiped all the GC logs on restart [10:06:23] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment: switch to mira [dns] - 10https://gerrit.wikimedia.org/r/266206 (owner: 10Giuseppe Lavagetto) [10:06:30] this is why I would have expected also that file to be gone [10:06:32] <_joe_> elukey: the ones it knows about ;) [10:06:48] PROBLEM - Host mw1060 is DOWN: PING CRITICAL - Packet loss = 100% [10:06:59] makes sense :) [10:07:57] RECOVERY - Host mw1060 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [10:08:09] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:08:54] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment: switch to mira as the active deployment server [puppet] - 10https://gerrit.wikimedia.org/r/266208 (owner: 10Giuseppe Lavagetto) [10:10:08] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.123 second response time [10:11:19] <_joe_> !log switching the active deployment host to mira [10:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:14:00] is salt suposed to be restarted? [10:14:08] PROBLEM - salt-minion processes on cp1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:14:08] PROBLEM - salt-minion processes on es1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:14:28] PROBLEM - salt-minion processes on dbproxy1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:14:47] PROBLEM - salt-minion processes on ms-be2005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:14:58] PROBLEM - salt-minion processes on mc1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:07] PROBLEM - salt-minion processes on cp4007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:09] PROBLEM - salt-minion processes on cp3035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:18] PROBLEM - salt-minion processes on restbase2005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:28] PROBLEM - salt-minion processes on mc1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:29] PROBLEM - salt-minion processes on mw2179 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:29] PROBLEM - salt-minion processes on es1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:29] PROBLEM - salt-minion processes on mw2103 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:29] PROBLEM - salt-minion processes on db2069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:30] PROBLEM - salt-minion processes on ganeti2005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:34] <_joe_> jynus: what? [10:15:37] PROBLEM - salt-minion processes on mw2065 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:38] PROBLEM - salt-minion processes on mw1161 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:38] PROBLEM - salt-minion processes on mw1130 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:38] PROBLEM - salt-minion processes on elastic1026 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:40] <_joe_> oh wtf? [10:15:46] ^ [10:15:56] <_joe_> jynus: why should it? [10:15:57] PROBLEM - salt-minion processes on mx2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:57] PROBLEM - salt-minion processes on rdb2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:58] PROBLEM - salt-minion processes on mw2197 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:58] PROBLEM - salt-minion processes on wtp1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:15:58] PROBLEM - salt-minion processes on mw2116 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:05] <_joe_> I guess there is a temporary problem there [10:16:07] PROBLEM - salt-minion processes on dbstore1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:08] PROBLEM - salt-minion processes on db1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:08] PROBLEM - salt-minion processes on wtp1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:08] PROBLEM - salt-minion processes on rdb1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:08] PROBLEM - salt-minion processes on mc2012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:08] PROBLEM - salt-minion processes on californium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:08] PROBLEM - salt-minion processes on mw1147 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:17] PROBLEM - salt-minion processes on cp2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:17] PROBLEM - salt-minion processes on ms-be2012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:17] PROBLEM - salt-minion processes on cp2018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:19] PROBLEM - salt-minion processes on mw2194 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:25] the deployment server is a salt grain [10:16:27] PROBLEM - salt-minion processes on wtp2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:27] PROBLEM - salt-minion processes on mw2183 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:27] PROBLEM - salt-minion processes on mw2041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:27] PROBLEM - salt-minion processes on mw1171 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:27] PROBLEM - salt-minion processes on mw2170 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:28] PROBLEM - salt-minion processes on cp1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:28] PROBLEM - salt-minion processes on ms-be1012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:28] PROBLEM - salt-minion processes on cp2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:28] PROBLEM - salt-minion processes on berkelium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:29] PROBLEM - salt-minion processes on eventlog1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:37] PROBLEM - salt-minion processes on mw1188 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:37] PROBLEM - salt-minion processes on db2046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:37] PROBLEM - salt-minion processes on mw1124 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:38] PROBLEM - salt-minion processes on db1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:38] PROBLEM - salt-minion processes on wtp1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:39] PROBLEM - salt-minion processes on titanium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:42] <_joe_> thanks paravoid [10:16:58] <_joe_> so I guess salt will eventually sync via puppet [10:17:01] <_joe_> but let me see [10:17:07] it probably does, and puppet is restarting salt [10:17:13] I am checking it [10:17:19] Jan 25 10:10:30 mw1124 puppet-agent[4164]: (/Stage[main]/Trebuchet/Salt::Grain[trebuchet_master]/Exec[ensure_trebuchet_master_mira.codfw.wmnet]/returns) executed successfully [10:17:21] yep, it does [10:17:23] Jan 25 10:10:31 mw1124 kernel: [237427.253419] init: salt-minion main process (32513) terminated with status 1 [10:18:05] but is it related to deployment, unintended effect or nothing to do? [10:18:17] <_joe_> it's an "unintended effect" [10:18:24] ok, then everthig is good [10:18:38] not really :P [10:18:38] <_joe_> as well, puppet is restarting the minion, shouldn't cause the icinga alert to go off [10:18:44] salt is *not* running [10:18:58] I'm re-running puppet now to see if the next run will start it (it probably will) [10:19:05] well, it does if you run puppet [10:19:17] so but even if it is transient in nature, it needs 2 puppet cycles which is kind of a problem [10:19:20] <_joe_> paravoid: it's running on mc2012 where it was alarming [10:19:33] <_joe_> paravoid: yes, honestly I didn' expect salt to crash.... [10:20:20] <_joe_> and yes, it requires another puppet run, wtf [10:21:40] <_joe_> the "funny" fact is that it fails on systems that should /not/ be affected by deployments at all... [10:21:53] we could use salt to force a pupept run and... o wait [10:21:59] ema, _joe_: event logging has problems consuming messages from kafka [10:22:01] <_joe_> jynus: heh [10:22:08] <_joe_> elukey: since when? [10:22:16] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [30.0] [10:22:18] <_joe_> elukey: did you guys restart kafka? [10:22:21] elukey, eventlogging is stopped [10:22:27] <_joe_> oh ok [10:22:28] AFAIK [10:22:56] jynus, the mysql consumers are stopped but the processing of events into kafka should work [10:23:00] jynus: I believe that the consumers are stopped [10:23:27] mforns: how did you check the connection problem to the brokers [10:23:28] ? [10:23:58] EL dashboard: https://grafana.wikimedia.org/dashboard/db/eventlogging [10:24:19] elukey, in the logs: tail -fn 100 /srv/log/upstart/eventlogging_processor-client-side-00.log [10:24:31] ah okok on EL hosts, super :) [10:26:24] (03PS1) 10Ema: Update kafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/266209 [10:30:21] Hey ops-team, how are we on the kafka restart ? [10:31:14] joal: ema is adding the last change for the GClog size with the last gerrit patch [10:31:25] we should be able to be back with 3 nodes soon [10:31:36] for the moment only one up ? [10:32:24] not really sure how to read https://grafana.wikimedia.org/dashboard/db/kafka [10:32:51] 6operations, 10ops-eqiad: mw1172 is unresponsive, mgmt interface unreachable - https://phabricator.wikimedia.org/T124642#1961368 (10Joe) 3NEW [10:34:49] should we perform a master election? [10:36:09] ema, joal: it might be due to the fact that the brokers restarted are catching up from the down [10:36:13] (03PS1) 10Giuseppe Lavagetto: scap: exclude mw1172 from scap until it's properly set up [puppet] - 10https://gerrit.wikimedia.org/r/266210 (https://phabricator.wikimedia.org/T124642) [10:36:38] elukey: have they all been restarted now ? [10:36:51] (03CR) 10Giuseppe Lavagetto: [C: 032] scap: exclude mw1172 from scap until it's properly set up [puppet] - 10https://gerrit.wikimedia.org/r/266210 (https://phabricator.wikimedia.org/T124642) (owner: 10Giuseppe Lavagetto) [10:37:04] (03CR) 10Giuseppe Lavagetto: [V: 032] scap: exclude mw1172 from scap until it's properly set up [puppet] - 10https://gerrit.wikimedia.org/r/266210 (https://phabricator.wikimedia.org/T124642) (owner: 10Giuseppe Lavagetto) [10:37:05] I think only two [10:37:30] joal: I've fixed the disk space issues, restarted kafka1020 first and kafka 1012 later [10:37:59] Thanks ema. Problem was only for those 2 I guess [10:38:05] <_joe_> bbiab [10:38:31] is there anything else that needs to be done after restarting the service? [10:38:37] ema: I am trying to run kafka topic --describe without much success on 1020 [10:38:47] I believe we should just use https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka/Administration [10:39:48] Running kafka topic --describe on kafka1013, everythin seem fine [10:40:48] yeah, kafka topic --describe does not work on kafka1012 either [10:41:01] ema: now it works on 1020 [10:42:16] oh, sudo kafka topic --describe does not work, you need a login shell [10:42:16] ema, elukey : this chart seems to tell us everything is fine: https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen [10:42:36] We have had under-replicated partitions for 3 mionutes, and now we are good [10:42:52] (03PS1) 10Jcrespo: Depool pc1001 for maintenance (clone to pc1004) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266211 (https://phabricator.wikimedia.org/T121888) [10:44:19] joal, mforns: let's check EL logs to be sure [10:44:27] ok [10:44:35] mforns: I let you do that ;) [10:45:27] ema: let's wait for mforns confirmation that things are back to normal, then let's relec leaders [10:46:15] wait, logs are not being written, just a sec [10:47:32] ema: I guess you have applied rotated logs to every kafka node ? [10:48:17] joal: nope, I've added KAFKA_OPTS to /etc/default/kafka on 1020 and 1012 only [10:48:32] hm, ok ema [10:48:35] then I restarted kafka on 1020 and (after a while) on 1012 [10:48:43] right [10:50:01] I've !logged only the restart of kafka on 1012 though, sorry about that [10:50:49] (03CR) 10Jcrespo: [C: 032] Depool pc1001 for maintenance (clone to pc1004) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266211 (https://phabricator.wikimedia.org/T121888) (owner: 10Jcrespo) [10:50:50] ema: I've created a ticket for ottomata and elukey about the issue, can you document the actions taken please ? https://phabricator.wikimedia.org/T124644 [10:51:04] sure thing [10:51:22] Thanks :) [10:51:45] mira has giant banner saying "do not use this server". Should I use it? [10:51:46] ema: Thanks a lot for having quickly reacted :) [10:52:09] oh, tin says the same :-) [10:52:31] mmm, race condition, it seems [10:54:27] <_joe_> jynus: the damn motd thing [10:54:30] <_joe_> use mira [10:54:32] <_joe_> :( [10:54:38] [Mon Jan 25 10:54:01 2016] [hphp] [24034:7f252f6ead00:0:000001] [] Lost parent, LightProcess exiting [10:54:48] (I am trying) [10:55:10] <_joe_> uhm supposedly scap was tested from there [10:55:16] he [10:55:17] <_joe_> that is just hhvm exiting [10:55:30] <_joe_> what are you trying to sync? [10:55:41] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Depool pc1001 for maintenance (clone to pc1004) (duration: 01m 41s) [10:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:55:45] that ^ [10:56:03] probably should report that to releng? [10:56:14] <_joe_> what happened exactly? [10:56:17] it worked, though, no errors [10:56:27] <_joe_> ok [10:56:29] maybe it just restarted [10:56:44] <_joe_> nope, it's a cli script invoking "php" probably [10:57:12] yes, maybe the (unrelated) server restarted an it logged to the same place [10:57:45] <_joe_> no, if you run "php -r 'echo "hello\n";' [10:57:51] <_joe_> you will see that message [10:58:03] <_joe_> btw if you log into mira now the motd is gone [10:58:08] yep [10:58:18] thanks, joe [10:58:30] <_joe_> as clearly, the motd gets compiled on first loging, but not shown at that login (WTF) [10:58:44] I actually did this to test the whole thing [10:58:54] <_joe_> btw we can fix the lightprocess warning [10:59:00] (although the deploy was needed anyway) [10:59:07] <_joe_> :) [10:59:38] I know I am a whining user, but there are worse [10:59:53] so I prefer me bing the first [10:59:59] *being [11:01:51] joal: added a comment to T124644 documenting my actions [11:04:09] Thanks again :) [11:04:14] ema --^ [11:06:12] ema: system is stable now, we'll reelect leaders and see if there are other changes to be done with ottomata when he'll arrive [11:09:13] 6operations, 10ops-codfw: db2012 degraded RAID - https://phabricator.wikimedia.org/T124645#1961442 (10jcrespo) 3NEW [11:10:02] joal: alright, then if everything is OK we should also merge https://gerrit.wikimedia.org/r/#/c/266209/ and re-enable puppet on kafka1012 and kafka1020 [11:12:45] 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1961451 (10jcrespo) @Cmjohnson, do you need something for doing this? We have now 4 failed disks at the other datacenter. [11:13:57] <_joe_> jynus: that's papaul I guess [11:14:07] <_joe_> (codfw) [11:14:09] ema: works for me [11:14:10] no [11:14:16] chris [11:14:22] <_joe_> oh ok [11:14:28] believe me or read the ticket :-) [11:14:46] <_joe_> yes, sorry [11:15:00] the clue is " at the other datacenter" [11:15:02] :-) [11:15:26] ema: when you have merged, I gues you need to restart other nodes, right ? [11:15:41] joal: right [11:15:46] ok [11:15:53] 6operations, 10Salt: Salt minions randomly crashing when the deployment server grain gets changed - https://phabricator.wikimedia.org/T124646#1961461 (10Joe) 3NEW [11:15:59] and truncate the gigantic logfiles :) [11:16:02] <_joe_> apergos: ^^ [11:16:08] ema : I'll monitor while you go for that, and then we can reelect [11:16:15] right ema :) [11:16:19] what is the state of fluorine, btw? [11:16:48] <_joe_> jynus: good [11:16:50] joal: the idea is, after merging and restarting the kafka service we will still have the old logfiles on disk, but kafka should not be using them anymore [11:16:57] no isue I suppose after the revert [11:16:58] so it should be safe to truncate them at that point [11:17:12] I suppose all 11-dependent [11:17:16] thanks [11:17:18] PROBLEM - Host mw1185 is DOWN: PING CRITICAL - Packet loss = 100% [11:17:37] ema: I can't recall having used them ever, si I guess it's fine to truncate :) [11:17:39] 6operations, 10Salt: Salt minions randomly crashing when the deployment server grain gets changed - https://phabricator.wikimedia.org/T124646#1961469 (10ArielGlenn) a:3ArielGlenn [11:18:39] <_joe_> apergos: "thanks..." [11:18:47] RECOVERY - Host mw1185 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [11:21:48] PROBLEM - Host mw1061 is DOWN: PING CRITICAL - Packet loss = 100% [11:22:18] RECOVERY - Host mw1061 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [11:22:27] PROBLEM - Apache HTTP on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.120 second response time [11:23:07] PROBLEM - HHVM rendering on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:37] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.660 second response time [11:25:07] RECOVERY - HHVM rendering on mw1061 is OK: HTTP OK: HTTP/1.1 200 OK - 69689 bytes in 0.166 second response time [11:25:58] PROBLEM - Host mw1214 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:06] !log stopping mysql at pc1001 and cloning to pc1004 [11:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:11] 6operations: Integrate jessie 8.3 point update - https://phabricator.wikimedia.org/T124647#1961478 (10MoritzMuehlenhoff) 3NEW [11:26:58] RECOVERY - Host mw1214 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [11:29:38] PROBLEM - Host mw1046 is DOWN: PING CRITICAL - Packet loss = 100% [11:30:47] RECOVERY - Host mw1046 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [11:36:54] (03CR) 10Elukey: [C: 031] Update kafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/266209 (owner: 10Ema) [11:37:38] (03PS2) 10Ema: Update kafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/266209 [11:41:10] moving 1.6 TB, that will take a while [11:42:17] PROBLEM - Host mw1178 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:40] ETA 3 hours [11:42:59] PROBLEM - HHVM rendering on mw1069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:43:21] seems compression doesn't do much- things are probably already compressed [11:44:59] RECOVERY - HHVM rendering on mw1069 is OK: HTTP OK: HTTP/1.1 200 OK - 69689 bytes in 0.195 second response time [11:46:35] (03CR) 10Giuseppe Lavagetto: [C: 031] Update kafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/266209 (owner: 10Ema) [11:47:40] (03CR) 10Ema: [V: 032] Update kafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/266209 (owner: 10Ema) [11:48:10] (03CR) 10Ema: [C: 032] Update kafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/266209 (owner: 10Ema) [11:50:08] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1172 is unresponsive, mgmt interface unreachable - https://phabricator.wikimedia.org/T124642#1961516 (10akosiaris) p:5Triage>3Normal [11:51:51] (03Draft1) 10Addshore: wgRCWatchCategoryMembership true everywhere except wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 [11:52:19] (03CR) 10Alexandros Kosiaris: Don't send puppet nags to the novaadmin user. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266192 (https://phabricator.wikimedia.org/T124516) (owner: 10Andrew Bogott) [11:52:36] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: mail from testlabs to ops list - https://phabricator.wikimedia.org/T124516#1961519 (10akosiaris) p:5Triage>3Normal [11:52:43] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: mail from testlabs to ops list - https://phabricator.wikimedia.org/T124516#1958596 (10akosiaris) p:5Normal>3High [11:53:27] PROBLEM - Host mw1217 is DOWN: PING CRITICAL - Packet loss = 100% [11:56:51] (03PS1) 10Muehlenhoff: Make neodymium the debdeploy master [puppet] - 10https://gerrit.wikimedia.org/r/266214 [11:56:53] (03PS1) 10Muehlenhoff: Move debdeploy::master off palladium [puppet] - 10https://gerrit.wikimedia.org/r/266215 [11:59:15] (03CR) 10Alex Monk: Don't send puppet nags to the novaadmin user. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266192 (https://phabricator.wikimedia.org/T124516) (owner: 10Andrew Bogott) [11:59:31] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1961527 (10matmarex) Has this happened? I noticed I've gotten logged out everywhere sometime over the weekend, not sure if that was just a coincidence. [12:01:48] PROBLEM - Host mw1054 is DOWN: PING CRITICAL - Packet loss = 100% [12:01:58] joal, elukey: puppet merge done, kafka1018 has applied the changes successfully. Let me know if/when it is OK to restart kafka on kafka1018 [12:02:28] RECOVERY - Host mw1054 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [12:03:23] ema, elukey: ok for me to restart 1018 [12:03:59] ema: Please let me know when done, need to restart eventlogging to cope with changes [12:04:34] !log restarting kafka on kafka1018 [12:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:05:41] joal: done [12:05:47] thanks ema [12:06:47] PROBLEM - Apache HTTP on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:07:05] I've also truncated /var/log/kafka/kafkaServer-gc.log to reclaim disk space on kafka1018 [12:07:39] now I'm going to re-enable puppet on kafka1012 and kafka1020 [12:08:47] RECOVERY - Apache HTTP on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.026 second response time [12:11:40] joal: done. Is everything fine with eventlogging? [12:12:51] We have restarted it, so should be ok [12:13:09] ema: --^ [12:13:48] PROBLEM - Host mw1039 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:48] RECOVERY - Host mw1039 is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [12:15:17] PROBLEM - HHVM rendering on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:17:17] RECOVERY - HHVM rendering on mw1055 is OK: HTTP OK: HTTP/1.1 200 OK - 69604 bytes in 0.222 second response time [12:18:22] ema: everything good on our side [12:19:01] perfect [12:20:45] !log compressed and truncated iridium's phab daemons.log - it was taking 20% of disk space [12:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:26:38] PROBLEM - Apache HTTP on mw1102 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.146 second response time [12:26:49] (03PS1) 10Giuseppe Lavagetto: role::deployment::salt_masters: correct a hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/266216 [12:28:45] (03PS2) 10Muehlenhoff: Make neodymium the debdeploy master [puppet] - 10https://gerrit.wikimedia.org/r/266214 [12:28:48] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.607 second response time [12:29:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] Make neodymium the debdeploy master [puppet] - 10https://gerrit.wikimedia.org/r/266214 (owner: 10Muehlenhoff) [12:30:53] s6 hiccup [12:31:04] I think it already solved [12:33:47] PROBLEM - Host mw1237 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:28] RECOVERY - Host mw1237 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [12:37:25] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review, 7WorkType-NewFunctionality: Phase out operations-puppet-pep8 Jenkins job and tools/puppet_pep8.py - https://phabricator.wikimedia.org/T114887#1961592 (10scfc) There is a non-voting job `operations-puppet-tox-pep8-jessie` which is cons... [12:37:57] PROBLEM - Host mw1064 is DOWN: PING CRITICAL - Packet loss = 100% [12:38:23] !log restarting kafka on kafka1014 [12:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:38:37] PROBLEM - HHVM rendering on mw1066 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [12:38:48] RECOVERY - Host mw1064 is UP: PING OK - Packet loss = 0%, RTA = 2.29 ms [12:40:39] RECOVERY - HHVM rendering on mw1066 is OK: HTTP OK: HTTP/1.1 200 OK - 69581 bytes in 1.636 second response time [12:41:09] (03PS1) 10Giuseppe Lavagetto: neodymium: add role::deployment::salt_masters [puppet] - 10https://gerrit.wikimedia.org/r/266218 [12:45:27] PROBLEM - Host mw1240 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:38] RECOVERY - Host mw1240 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [12:49:49] (03PS2) 10ArielGlenn: move git-deploy to neodymium (primary salt master) [puppet] - 10https://gerrit.wikimedia.org/r/265248 [12:50:57] PROBLEM - HHVM rendering on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:52:57] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 69602 bytes in 0.107 second response time [12:54:50] (03PS3) 10ArielGlenn: move git-deploy to neodymium (primary salt master) [puppet] - 10https://gerrit.wikimedia.org/r/265248 [12:55:51] (03CR) 10ArielGlenn: [C: 032] move git-deploy to neodymium (primary salt master) [puppet] - 10https://gerrit.wikimedia.org/r/265248 (owner: 10ArielGlenn) [12:57:52] !log restarting kafka on kafka1013 [12:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:58:08] PROBLEM - Host mw1043 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:08] PROBLEM - Host mw1096 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:48] RECOVERY - Host mw1043 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [12:58:57] RECOVERY - Host mw1096 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [12:59:08] PROBLEM - HHVM rendering on mw1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:01:11] RECOVERY - HHVM rendering on mw1043 is OK: HTTP OK: HTTP/1.1 200 OK - 69603 bytes in 0.125 second response time [13:07:02] !log restarting kafka on kafka1022 [13:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:07:26] (03PS1) 10Muehlenhoff: Remove debdeploy::master from palladium [puppet] - 10https://gerrit.wikimedia.org/r/266219 [13:10:47] PROBLEM - HHVM rendering on mw1073 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [13:12:58] RECOVERY - HHVM rendering on mw1073 is OK: HTTP OK: HTTP/1.1 200 OK - 69604 bytes in 1.787 second response time [13:14:48] PROBLEM - Apache HTTP on mw1238 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.082 second response time [13:16:12] !log ran kafka preferred-replica-election on kafka1022 to balance the leaders [13:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:58] RECOVERY - Apache HTTP on mw1238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [13:17:07] PROBLEM - Host mw1257 is DOWN: PING CRITICAL - Packet loss = 100% [13:17:09] PROBLEM - HHVM rendering on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:19:08] RECOVERY - HHVM rendering on mw1049 is OK: HTTP OK: HTTP/1.1 200 OK - 69604 bytes in 0.537 second response time [13:19:24] does a kafka restart trigger errors in HHVM machines? o_O [13:22:04] <_joe_> no [13:22:11] <_joe_> it's a rolling reboot still going on [13:22:35] <_joe_> I'm going to stop it now since I'm going to lunch [13:31:34] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1961657 (10Reedy) >>! In T124440#1961527, @matmarex wrote: > Has this happened? I noticed I've gotten logged out everywhere sometime over the weekend, not sure if tha... [13:32:58] (03CR) 10Hashar: [C: 031] Correct HTML code for WMF image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264461 (owner: 10Suriyaa Kudo) [13:33:06] (03PS3) 10ArielGlenn: rewrite pagerange.py so it's both fast and useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/264040 (https://phabricator.wikimedia.org/T123571) [13:40:27] PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: Puppet has 1 failures [13:43:09] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: puppet fail [13:50:13] Krenair: can you eval.php something for me on enwiki? `Article::purgePatrolFooterCache( 45686529 );` (debugging T123747) [13:58:38] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [14:06:16] 6operations, 10Salt: Salt minions randomly crashing when the deployment server grain gets changed - https://phabricator.wikimedia.org/T124646#1961711 (10ArielGlenn) Tenatively this looks like an issue with the singleton cache of master aes keys at the minion end, a part of the code in transport that needs to b... [14:06:18] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [14:07:37] PROBLEM - Host mw1082 is DOWN: PING CRITICAL - Packet loss = 100% [14:08:57] RECOVERY - Host mw1082 is UP: PING OK - Packet loss = 0%, RTA = 1.86 ms [14:09:08] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [14:12:07] PROBLEM - Host mw1086 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:28] PROBLEM - Host mw1097 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:58] RECOVERY - Host mw1086 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [14:12:58] RECOVERY - Host mw1097 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [14:13:18] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:18] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.030 second response time [14:18:34] MarkTraceur: can you eval.php something for me on enwiki? `Article::purgePatrolFooterCache( 45686529 );` (debugging T123747) [14:19:05] Uhhmmm [14:19:20] Never done that before but I'll give it a shot [14:22:05] MatmaRex: Is there a basic how-to for eval.php? --help seemed to have no effect. [14:22:35] Aha [14:23:01] ugh, no idea how it works in WMF environment, to be honest [14:23:19] MatmaRex: OK, done [14:26:06] MatmaRex, sorry, was afk [14:27:39] Krenair: I got it, no worries [14:28:57] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [14:31:27] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 6 failures [14:31:58] PROBLEM - Host mw1090 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:08] RECOVERY - Host mw1090 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [14:33:17] PROBLEM - HHVM rendering on mw1053 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [14:35:13] (03CR) 10Ottomata: "Thanks!" [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266203 (owner: 10Ema) [14:35:28] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 69603 bytes in 7.544 second response time [14:38:11] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: Rename cbk-zamwiki to cbkwiki - https://phabricator.wikimedia.org/T124657#1961729 (10Bugreporter) 3NEW [14:39:37] PROBLEM - Host mw1108 is DOWN: PING CRITICAL - Packet loss = 100% [14:39:57] PROBLEM - Host mw1056 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:07] RECOVERY - Host mw1108 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [14:41:08] RECOVERY - Host mw1056 is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [14:42:48] (03PS1) 10Jcrespo: Replace pc1001 with pc1004, new pooled parsercache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266225 (https://phabricator.wikimedia.org/T121888) [14:43:28] 6operations, 10Salt: Salt minions randomly crashing when the deployment server grain gets changed - https://phabricator.wikimedia.org/T124646#1961739 (10ArielGlenn) To keep minions from dying we should do this: in transport/__init__.py, in crypted_transfer_decode_dictentry() instead of aes = key.pri... [14:43:53] (03PS2) 10Jcrespo: Replace pc1001 with pc1004, new pooled parsercache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266225 (https://phabricator.wikimedia.org/T121888) [14:44:37] PROBLEM - Host mw1091 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:15] I am going to pool a new parser cache (new hardware, new mariadb version, etc) [14:45:17] RECOVERY - Host mw1091 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [14:45:55] nice [14:46:07] PROBLEM - puppet last run on mw1056 is CRITICAL: CRITICAL: puppet fail [14:46:10] it is only 1 out of 3 servers, and it should be "easy", but as there are many new things, we should be aware of potential errors [14:46:36] gotcha [14:46:44] I will be monitoring logs thoughtout the day, but please tell of any mediawiki errors, etc. [14:47:18] also things could be slightly slower for some hours [14:47:33] (03PS1) 10Muehlenhoff: Tighten dependency [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/266227 [14:47:37] PROBLEM - puppet last run on mw1091 is CRITICAL: CRITICAL: puppet fail [14:47:58] PROBLEM - Host mw1180 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:08] RECOVERY - puppet last run on mw1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:48:21] (03PS1) 10Ottomata: Revert "Temporarily disable eventlogging mysql consumers and burrow monitoring for them" [puppet] - 10https://gerrit.wikimedia.org/r/266228 (https://phabricator.wikimedia.org/T120187) [14:48:33] (03PS2) 10Ottomata: Revert "Temporarily disable eventlogging mysql consumers and burrow monitoring for them" [puppet] - 10https://gerrit.wikimedia.org/r/266228 (https://phabricator.wikimedia.org/T120187) [14:49:17] RECOVERY - Host mw1180 is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [14:49:47] RECOVERY - puppet last run on mw1091 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:51:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] Tighten dependency [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/266227 (owner: 10Muehlenhoff) [14:51:51] (03CR) 10Jcrespo: [C: 032] Replace pc1001 with pc1004, new pooled parsercache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266225 (https://phabricator.wikimedia.org/T121888) (owner: 10Jcrespo) [14:52:17] PROBLEM - Host mw1094 is DOWN: PING CRITICAL - Packet loss = 100% [14:53:17] (03CR) 10Ottomata: [C: 032] Revert "Temporarily disable eventlogging mysql consumers and burrow monitoring for them" [puppet] - 10https://gerrit.wikimedia.org/r/266228 (https://phabricator.wikimedia.org/T120187) (owner: 10Ottomata) [14:53:18] RECOVERY - Host mw1094 is UP: PING OK - Packet loss = 0%, RTA = 1.67 ms [14:56:37] PROBLEM - Host mw1075 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:38] !log jynus@mira Synchronized wmf-config/db-eqiad.php: deploy new parsercache hardware (pc1004) substituting pc1001 (duration: 03m 25s) [14:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:07] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:57:17] RECOVERY - Host mw1075 is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [14:59:17] 6operations, 10vm-requests: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1961764 (10akosiaris) >>! In T124261#1957401, @Dzahn wrote: > @akosiaris I just wanted to avoid putting multiple "misc" sites/apps on the same server with the official releases. It would mean if one of them h... [15:00:47] I have to common-sync 3 servers, but I do not see immediate problems (errors on log) [15:01:05] I do not discard delayed problems, however [15:01:25] so I will not pool more than 1 server per day [15:01:27] PROBLEM - HHVM rendering on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:01:55] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 323 bytes in 0.040 second response time [15:02:17] PROBLEM - HHVM rendering on mw1184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:18] PROBLEM - Auth DNS for labs pdns on labs-ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:03:28] RECOVERY - HHVM rendering on mw1098 is OK: HTTP OK: HTTP/1.1 200 OK - 69628 bytes in 0.171 second response time [15:03:50] andrewbogott: about?^ [15:03:53] not sure on the dns error [15:04:14] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 772586 bytes in 3.037 second response time [15:04:23] I see many webservice start on tools-checker-02 [15:04:27] RECOVERY - HHVM rendering on mw1184 is OK: HTTP OK: HTTP/1.1 200 OK - 69621 bytes in 0.154 second response time [15:04:28] RECOVERY - Auth DNS for labs pdns on labs-ns2.wikimedia.org is OK: DNS OK: 0.018 seconds response time. nagiostest.eqiad.wmflabs returns [15:04:38] and now the 're gone [15:04:39] (03PS1) 10Muehlenhoff: Fix syntax [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/266229 [15:04:57] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix syntax [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/266229 (owner: 10Muehlenhoff) [15:11:22] (03PS1) 10Ema: ulsfo: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266230 (https://phabricator.wikimedia.org/T109286) [15:11:23] I don't see anything out of the ordinary in labs-ns2 pdns logs around that time btw [15:11:58] PROBLEM - Auth DNS for labs pdns on labs-ns3.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:12:08] I don't entirely grok the labs dns setup yet but I put in a call to andrewbogott just now [15:13:18] PROBLEM - Apache HTTP on mw1033 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.179 second response time [15:13:28] PROBLEM - Apache HTTP on mw1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:13:38] PROBLEM - HHVM rendering on mw1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:13:58] RECOVERY - Auth DNS for labs pdns on labs-ns3.wikimedia.org is OK: DNS OK: 0.022 seconds response time. nagiostest.eqiad.wmflabs returns [15:14:03] (03CR) 10BBlack: [C: 031] ulsfo: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266230 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [15:14:34] !log restart of pdns and pdns-recursor on labservices1001 [15:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:28] RECOVERY - Apache HTTP on mw1033 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.031 second response time [15:15:29] RECOVERY - Apache HTTP on mw1085 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.120 second response time [15:15:47] RECOVERY - HHVM rendering on mw1085 is OK: HTTP OK: HTTP/1.1 200 OK - 69584 bytes in 2.387 second response time [15:16:27] PROBLEM - Host mw1104 is DOWN: PING CRITICAL - Packet loss = 100% [15:17:00] (03PS1) 10Giuseppe Lavagetto: deployment: allow fixing the git remote repository [puppet] - 10https://gerrit.wikimedia.org/r/266231 [15:17:28] RECOVERY - Host mw1104 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [15:17:28] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:38] (03CR) 10Ema: [C: 032 V: 032] ulsfo: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266230 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [15:19:28] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.135 second response time [15:19:58] <_joe_> apergos: https://gerrit.wikimedia.org/r/266231 this should do the trick I think [15:20:08] I was just looking [15:21:05] so what about them submodules? [15:21:14] I have no idea how that crap works btw. zero [15:21:38] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:38] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.029 second response time [15:23:56] _joe_: where is 'server' used in _fix_remote()? [15:24:11] <_joe_> apergos: uhm, it's not used? [15:24:17] <_joe_> sorry, doing another thing atm [15:24:37] PROBLEM - Host mw1028 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:02] why is it passed in as an arg? (whenever you get back to this) [15:25:37] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: puppet fail [15:25:47] RECOVERY - Host mw1028 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [15:26:07] PROBLEM - puppet last run on mw2068 is CRITICAL: CRITICAL: puppet fail [15:26:09] <_joe_> apergos: because it was used in a preceding version [15:29:44] (03PS2) 10Giuseppe Lavagetto: deployment: allow fixing the git remote repository [puppet] - 10https://gerrit.wikimedia.org/r/266231 [15:29:49] I wonder if you will want to runas=deploy_user and set a umask too [15:29:50] <_joe_> apergos: ^^ [15:29:53] _joe_: [15:30:00] <_joe_> apergos: for changing a remote? [15:30:01] <_joe_> no [15:30:31] if they are left owned by the current owner then it's fine [15:30:36] I don't know if that's the case [15:30:36] <_joe_> apergos: the git dir is owned by root anyways afaics [15:30:39] (03CR) 10jenkins-bot: [V: 04-1] deployment: allow fixing the git remote repository [puppet] - 10https://gerrit.wikimedia.org/r/266231 (owner: 10Giuseppe Lavagetto) [15:31:20] <_joe_> apergos: also, who else should ever change the remote repo? [15:31:40] no one, but other things in the config might be changed by the git dpeloy script [15:31:48] I don't rule it out, anyways [15:31:52] <_joe_> I mean all commands for git operations in deploy are not run as runas [15:31:59] <_joe_> look at the code [15:32:25] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1172, mw1217, mw1178, mw1257 are unresponsive, mgmt interface unreachable - https://phabricator.wikimedia.org/T124642#1961835 (10jcrespo) [15:32:37] <_joe_> apergos: I stand corrected: clone does [15:32:53] yeah I was looking at that in fact [15:33:17] <_joe_> let's do that [15:34:31] okey dokey [15:34:47] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 13.04% of data above the critical threshold [100000000.0] [15:34:49] <_joe_> apergos: nope actually it gets done only on the deployment server [15:34:54] <_joe_> not on the target [15:35:00] <_joe_> look at what _fetch_location does [15:35:22] !log Starting migration of mobile traffic to text cluster in ulsfo https://phabricator.wikimedia.org/T109286 [15:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:25] <_joe_> so I got it right [15:36:31] all right, I'll buy that [15:37:49] <_joe_> apergos: give me a review :) [15:37:52] do we want it to exit out on the first [15:38:05] give me a sec to finish typing :-P [15:38:05] <_joe_> if it gives an error, yes [15:38:21] on the first error as opposed to logging it and continuing on? [15:38:22] <_joe_> apergos: I mean on gerrit, with your time :P [15:38:31] <_joe_> yeah Id' exit frankly [15:38:57] (03PS1) 10Jcrespo: Exclude mw1217, mw1178 and mw1257 from SCAP until they are brought up [puppet] - 10https://gerrit.wikimedia.org/r/266238 (https://phabricator.wikimedia.org/T124642) [15:40:06] ^I just need someone with better eyes to +1 [15:41:52] _joe_: wanna fix pep8's whine? [15:42:10] <_joe_> apergos: yeah I will do [15:43:50] (03PS3) 10Giuseppe Lavagetto: deployment: allow fixing the git remote repository [puppet] - 10https://gerrit.wikimedia.org/r/266231 [15:44:38] Hello. Any shell access holder: Could you please run "updateArticleCount.php" for jawp? [15:46:22] <_joe_> apergos: I'm gonna merge this, I'm pretty confident it will work. I can use it on netmon1001 and test the various repos there [15:47:10] _joe_: you have deploy_user set in ther ebut you don't use it now [15:47:20] <_joe_> oh meh [15:47:27] rxy: if no one does it now, you might want to file a Phabricator task requesting it, in Wikimedia-Site-Requests project :) (i don't have the access) [15:47:43] rxy: i think that script is supposed to run every month, by the way [15:47:59] (03PS4) 10Giuseppe Lavagetto: deployment: allow fixing the git remote repository [puppet] - 10https://gerrit.wikimedia.org/r/266231 [15:48:39] (03CR) 10ArielGlenn: [C: 031] deployment: allow fixing the git remote repository [puppet] - 10https://gerrit.wikimedia.org/r/266231 (owner: 10Giuseppe Lavagetto) [15:48:59] (03PS5) 10Giuseppe Lavagetto: deployment: allow fixing the git remote repository [puppet] - 10https://gerrit.wikimedia.org/r/266231 [15:48:59] a +1 with no comments so no one knows what it means :-P [15:49:10] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] deployment: allow fixing the git remote repository [puppet] - 10https://gerrit.wikimedia.org/r/266231 (owner: 10Giuseppe Lavagetto) [15:49:48] MatmaRex: hmm. okay. Thanks for your advice. :) [15:52:36] (03PS2) 10Jcrespo: Exclude mw1217, mw1178 and mw1257 from SCAP until they are brought up [puppet] - 10https://gerrit.wikimedia.org/r/266238 (https://phabricator.wikimedia.org/T124642) [15:53:47] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:54:17] RECOVERY - puppet last run on mw2068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:19] (03CR) 10Jcrespo: [C: 032] "Nodes depooled already." [puppet] - 10https://gerrit.wikimedia.org/r/266238 (https://phabricator.wikimedia.org/T124642) (owner: 10Jcrespo) [15:56:12] 6operations, 10Wikimedia-Site-Requests: Rename cbk-zamwiki to cbkwiki - https://phabricator.wikimedia.org/T124657#1961896 (10Aklapper) [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160125T1600). Please do the needful. [16:00:04] James_F: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:58] RECOVERY - salt-minion processes on sca1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:01:09] RECOVERY - salt-minion processes on sca1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:01:31] (03PS1) 10Jcrespo: Preparing ips for new parsercache deployments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266242 [16:02:10] pc1004 seem to be working really nicely [16:02:26] 6operations, 10ops-codfw: db2012 degraded RAID - https://phabricator.wikimedia.org/T124645#1961906 (10Papaul) a:3Papaul [16:02:27] is this the new one? [16:02:33] James_F|Away: I can SWAT for you when you're around [16:03:14] (03PS2) 10Jcrespo: Preparing ips for new parsercache deployments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266242 [16:05:27] (03CR) 10Jcrespo: [C: 032] Preparing ips for new parsercache deployments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266242 (owner: 10Jcrespo) [16:05:57] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2039.codfw.wmnet because of too many down!: rendering_80 - Could not depool server mw2087.codfw.wmnet because of too many down! [16:07:45] 6operations, 10ops-codfw: db2012 degraded RAID - https://phabricator.wikimedia.org/T124645#1961913 (10Papaul) Unfortunately this system is out of warranty like T120073 and T117848 [16:08:08] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [16:09:48] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Preparing ips for new parsercache deployments (duration: 03m 32s) [16:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:08] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [16:18:09] James_F: if you want to get the VE mediawiki/config patches out for SWAT, let me know. [16:20:24] (03PS1) 10Giuseppe Lavagetto: git-deploy: fixup for Ied4bcde607 [puppet] - 10https://gerrit.wikimedia.org/r/266246 [16:20:34] 6operations, 10ops-codfw, 5Patch-For-Review: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#1961963 (10Papaul) @RobH I talked to Dell for a replacement drive for this system. According to them this system has only a basic support and not a pro-... [16:20:46] (03PS2) 10Giuseppe Lavagetto: git-deploy: fixup for Ied4bcde607 [puppet] - 10https://gerrit.wikimedia.org/r/266246 [16:21:03] (03CR) 10Giuseppe Lavagetto: [C: 032] git-deploy: fixup for Ied4bcde607 [puppet] - 10https://gerrit.wikimedia.org/r/266246 (owner: 10Giuseppe Lavagetto) [16:21:24] (03CR) 10Giuseppe Lavagetto: [V: 032] git-deploy: fixup for Ied4bcde607 [puppet] - 10https://gerrit.wikimedia.org/r/266246 (owner: 10Giuseppe Lavagetto) [16:22:17] PROBLEM - mediawiki-installation DSH group on mw1157 is CRITICAL: Host mw1157 is not in mediawiki-installation dsh group [16:22:38] <_joe_> jynus: ^^ ? [16:23:23] thcipriani: Sorry, yeah, scrubbed due to wmf.10 [16:23:28] yeah, mmm [16:23:39] James_F: kk, np. [16:23:47] that is a mistake [16:24:01] <_joe_> !log running salt deploy.fixurl on all deployment targets [16:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:55] 6operations, 10ops-codfw: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#1961981 (10RobH) [16:25:18] <_joe_> !log restarting salt-minion on all deployment targets [16:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:02] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Preparing ips for new parsercache deployments (second try after running puppet) (duration: 03m 23s) [16:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:07] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: install/deploy codfw appservers - https://phabricator.wikimedia.org/T85227#1961984 (10Papaul) [16:26:09] 6operations, 10ops-codfw: mw2098 non-responsive to mgmt - https://phabricator.wikimedia.org/T85286#1961982 (10Papaul) 5Open>3Resolved Close this. This system is back up. [16:27:35] (03PS1) 10Jcrespo: Fixing bug in wrong server being excluded from dsh (mw1257, not mw1157) [puppet] - 10https://gerrit.wikimedia.org/r/266250 [16:29:36] I do not trusy myself, can someone review 266250, compared to T124642/reality before 17h? (deployment time) [16:30:28] (03PS2) 10Jcrespo: Fixing bug in wrong server being excluded from dsh (mw1257, not mw1157) [puppet] - 10https://gerrit.wikimedia.org/r/266250 (https://phabricator.wikimedia.org/T124642) [16:32:50] (03CR) 10ArielGlenn: [C: 031] Fixing bug in wrong server being excluded from dsh (mw1257, not mw1157) [puppet] - 10https://gerrit.wikimedia.org/r/266250 (https://phabricator.wikimedia.org/T124642) (owner: 10Jcrespo) [16:33:25] (03CR) 10Jcrespo: [C: 032] "Thanks, Ariel" [puppet] - 10https://gerrit.wikimedia.org/r/266250 (https://phabricator.wikimedia.org/T124642) (owner: 10Jcrespo) [16:34:21] ;-) [16:37:18] <_joe_> is swat going to happen today? [16:38:19] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Preparing ips for new parsercache deployments (third try) (duration: 01m 35s) [16:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:38:49] hey, I got it right at the third time! [16:39:39] <_joe_> jynus: at least mira works well, as far as scap is concerned [16:39:39] (I would have gotten the error there before icinga if I wasn't trying to do n things at the same time) [16:40:16] _joe_: only patches that were up for SWAT were scrubbed due to the wmf.11 rollback. So nothing to SWAT today. [16:40:19] feliz año jynus :) [16:40:43] ori, would you be able to take a look at those puppet patches today? [16:40:44] <_joe_> thcipriani: ok, jynus has been using it without issues [16:40:49] they have +1s already. [16:41:03] yep, all problematic servers undshized [16:41:04] <_joe_> subbu: hi :) when do you plan to deploy parsoid this week? [16:41:14] thanks, mafk [16:41:17] hi .. no plans today. [16:41:21] <_joe_> If it's not the deep of the night, I'll try to be around [16:41:33] it could have been worse [16:41:36] <_joe_> subbu: even in the next days, we should check that trebuchet works as expected [16:41:45] I could have depooled the wrong servers :-/ [16:41:49] _joe_, what is this about? [16:42:18] <_joe_> subbu: we switched the deployment server, see my mails in ops@ [16:42:43] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#1962060 (10faidon) 3NEW [16:44:33] <_joe_> subbu: anyways, scap can be deployed correctly, you shouldn't have any issues [16:44:45] _joe_, ah, ok. got it. [16:44:45] but, no parsoid deploy today. possibly wednesday, unsure. [16:45:23] <_joe_> subbu: in case, try to ping me :) [16:45:41] _joe_, ok. will ping you before we attempt parsoid deploys, yes. [16:47:15] (03PS1) 10Ema: ulsfo: remove varnish-fe,nginx services from mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266253 (https://phabricator.wikimedia.org/T109286) [16:49:12] !log Finished migration of mobile traffic to text cluster in ulsfo https://phabricator.wikimedia.org/T109286 [16:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:59] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#1962078 (10Joe) 3NEW [16:55:47] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Check the redis configuration in codfw - https://phabricator.wikimedia.org/T124672#1962087 (10Joe) 3NEW [16:56:49] (03CR) 10BBlack: [C: 031] ulsfo: remove varnish-fe,nginx services from mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266253 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [17:00:01] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#1962096 (10Joe) 3NEW [17:01:24] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#1962104 (10Joe) [17:01:49] 6operations, 6Analytics-Kanban, 7HTTPS, 5Patch-For-Review: EventLogging sees too few distinct client IPs {oryx} [8 pts] - https://phabricator.wikimedia.org/T119144#1962106 (10Nuria) 5Open>3Resolved [17:02:15] (03CR) 10Ema: [C: 032 V: 032] ulsfo: remove varnish-fe,nginx services from mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266253 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [17:02:23] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#1962112 (10Joe) [17:23:28] RECOVERY - mediawiki-installation DSH group on mw1157 is OK: OK [17:24:33] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: mail from testlabs to ops list - https://phabricator.wikimedia.org/T124516#1962167 (10Dzahn) also let's stop using ops@ wherever we see it and use ops@lists , the proper list address [17:34:08] 6operations: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#1962208 (10Dzahn) a:3elukey [17:38:12] 6operations, 10RESTBase: Reduce log spam by removing non-operational cassandra IPs from seeds - https://phabricator.wikimedia.org/T123869#1962225 (10Eevans) @gwicke: Trying to pin this down a bit more, are the "marking node X as down" messages in question limited those that have been filtered in the dashboard?... [17:40:55] 6operations: Integrate jessie 8.3 point update - https://phabricator.wikimedia.org/T124647#1962233 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [17:42:57] mw1019 seems to be logging quite a few "Lost parent, LightProcess exiting" HHVM errors over the last 6 hours or so. [17:43:08] They look to come in bunches [17:43:46] <_joe_> bd808: that's hhvm crashing, I guess [17:43:56] <_joe_> that's what causes those logs [17:45:02] Over the last 12h there are 13k crashes on mw1019 then :/ [17:45:24] and about 40 across the rest of the server fleet in the same timewindow [17:47:05] (03PS1) 10Elukey: Add the moving average function to the event logging's insert rate alarming metric. Bug: T124204 [puppet] - 10https://gerrit.wikimedia.org/r/266264 (https://phabricator.wikimedia.org/T124204) [17:47:52] <_joe_> bd808: it was a "red herring" ;) [17:50:03] 6operations: Migrate nitrogen to jessie - https://phabricator.wikimedia.org/T123732#1962273 (10Dzahn) was mentioned briefly in ops meeting. Faidon said this is used less and less over time and we can probably shut it down, but some networking work needs to be done first. adding #netops for that. [17:50:16] 6operations, 10netops: Migrate nitrogen to jessie - https://phabricator.wikimedia.org/T123732#1962276 (10Dzahn) [17:55:37] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [17:57:44] (03PS2) 10Elukey: Add the moving average function to the event logging's insert rate alarming metric. Bug: T124204 [puppet] - 10https://gerrit.wikimedia.org/r/266264 (https://phabricator.wikimedia.org/T124204) [17:58:41] (03PS5) 10Dzahn: admin: add dc-ops to install-server, allow puppet agent -t -v [puppet] - 10https://gerrit.wikimedia.org/r/264994 (https://phabricator.wikimedia.org/T123681) [17:58:48] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1962303 (10RobH) Please note that the request as stated above (as of the date of this comment) were reviewed and approved in the operations meeting. (I'm merely... [17:58:59] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add datacenter-ops to dhcp /install-server and allow to run puppet commands - https://phabricator.wikimedia.org/T123681#1962313 (10RobH) Please note that the request as stated above (as of the date of this comment) were reviewed and approved in the opera... [17:59:10] (03CR) 10Dzahn: [C: 032] "approved in meeting" [puppet] - 10https://gerrit.wikimedia.org/r/264994 (https://phabricator.wikimedia.org/T123681) (owner: 10Dzahn) [18:01:23] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: mail from testlabs to ops list - https://phabricator.wikimedia.org/T124516#1962320 (10Andrew) > ops@lists @dzahn, can you clarify? What is the full email address I should be using here? [18:01:45] <_joe_> thcipriani: is the meeting off? I'm ok with it btw [18:02:01] <_joe_> *very* ok [18:02:02] _joe_: coming, RelEng meeting running long [18:02:07] <_joe_> ah, damn [18:02:09] <_joe_> :P [18:04:28] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: mail from testlabs to ops list - https://phabricator.wikimedia.org/T124516#1962345 (10Dzahn) >>! In T124516#1962320, @Andrew wrote: > @dzahn, can you clarify? What is the full email address I should be using here? ops@lists.wikimedia.org plea... [18:06:36] 6operations, 10ops-codfw, 10hardware-requests: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#1962360 (10RobH) [18:07:21] 6operations: labtestservices2001.wikimedia.org.crt - https://phabricator.wikimedia.org/T124374#1962373 (10Andrew) a:5MoritzMuehlenhoff>3Andrew [18:07:49] (03PS3) 10Elukey: Add the moving average function to the event logging's insert rate alarming metric. Bug: T124204 [puppet] - 10https://gerrit.wikimedia.org/r/266264 (https://phabricator.wikimedia.org/T124204) [18:09:42] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: mail from testlabs to ops list - https://phabricator.wikimedia.org/T124516#1962395 (10Andrew) a:3Andrew [18:10:16] 6operations, 10Architecture, 10DBA: Architecture decision to solve the need larger serves (for better capacity and consolidation) vs. more, smaller servers (for high availability) - https://phabricator.wikimedia.org/T124681#1962396 (10jcrespo) [18:12:56] 6operations, 10Architecture, 10DBA: Architecture decision to solve the need larger serves (for better capacity and consolidation) vs. more, smaller servers (for high availability) - https://phabricator.wikimedia.org/T124681#1962409 (10jcrespo) [18:14:12] 6operations, 10Architecture, 10DBA: Architecture decision to solve the need larger serves (for better capacity and consolidation) vs. more, smaller servers (for high availability) - https://phabricator.wikimedia.org/T124681#1962388 (10jcrespo) CCing @mark so maybe he can assign someone to discuss this with.... [18:15:07] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [18:17:10] 6operations, 10ops-codfw, 10hardware-requests: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#1962423 (10RobH) I cannot connect to this machine via ssh or to its DRAC connection. I need to do this to verify its drive info. @Papaul: Please adv... [18:17:37] (03CR) 10Ottomata: [C: 031] "Looks good to me, mforns?" [puppet] - 10https://gerrit.wikimedia.org/r/266264 (https://phabricator.wikimedia.org/T124204) (owner: 10Elukey) [18:17:56] 6operations, 10RESTBase: Reduce log spam by removing non-operational cassandra IPs from seeds - https://phabricator.wikimedia.org/T123869#1962426 (10GWicke) @Eevans, those are only those that typically show up in production. There is another set for staging. [18:19:15] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add datacenter-ops to dhcp /install-server and allow to run puppet commands - https://phabricator.wikimedia.org/T123681#1962433 (10Dzahn) 5Open>3Resolved a:3Dzahn on carbon: Notice: /Stage[main]/Admin/Admin::Hashuser[pt1979]/Admin::User[pt1979]/Fi... [18:19:25] 10Ops-Access-Requests, 6operations: add datacenter-ops to dhcp /install-server and allow to run puppet commands - https://phabricator.wikimedia.org/T123681#1962436 (10Dzahn) [18:19:37] 6operations, 10Architecture, 10DBA: Architecture decision to solve the need larger serves (for better capacity and consolidation) vs. more, smaller servers (for high availability) - https://phabricator.wikimedia.org/T124681#1962437 (10jcrespo) [18:20:08] 6operations, 10Architecture, 10DBA: Architecture decision to solve the need larger serves (for better capacity and consolidation) vs. more, smaller servers (for high availability) - https://phabricator.wikimedia.org/T124681#1962388 (10jcrespo) [18:21:25] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:25] 6operations, 10Wiki-Loves-Monuments-General, 10Wikimedia-DNS, 5Patch-For-Review, 7domains: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1962449 (10Dzahn) >>! In T118468#1958689, @JanZerebecki wrote: > I guess, to transfer WMF legal needs to be willing to host all WLM... [18:21:37] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [24.0] [18:23:35] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.020 second response time [18:24:04] subbu: sure, I'll take a look now [18:24:29] thanks. [18:28:37] !log retroactively logging the depool of mw1217, mw1178 and mw1257 3 hours ago (Jan 25 15:45:26) [18:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:29:37] subbu: per https://wikitech.wikimedia.org/wiki/Puppet_coding#Classes , class and resource names should not have dashes; use underscores instead. Dashes can trip up Puppet's parser, which has a hard time distinguishing them from the minus operator in certain edge cases. [18:30:09] ori, mutante fixed it up in a later patch. [18:30:53] it is the last patch in the chain. [18:30:54] oh yeah, I see it now [18:30:58] great! [18:31:33] you know how it is with poorly-specified DSLs that grow organically ;) [18:31:38] ori: it's the very last one in the repo :) [18:31:43] then i will change global puppet-lint.rc [18:31:47] and they will never come back [18:31:50] \o/ [18:32:34] (03PS15) 10Ori.livneh: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [18:32:41] (03CR) 10Ori.livneh: [C: 032 V: 032] Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [18:33:04] (03PS7) 10Ori.livneh: Migrate parsoid::role::testing service from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [18:33:12] (03CR) 10Ori.livneh: [C: 032 V: 032] Migrate parsoid::role::testing service from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [18:33:24] (03PS3) 10Ori.livneh: Clone the 'ruthenium' branch of testreduce and visualdiff [puppet] - 10https://gerrit.wikimedia.org/r/265856 (owner: 10Subramanya Sastry) [18:33:36] (03CR) 10Ori.livneh: [C: 032 V: 032] Clone the 'ruthenium' branch of testreduce and visualdiff [puppet] - 10https://gerrit.wikimedia.org/r/265856 (owner: 10Subramanya Sastry) [18:34:08] (03PS3) 10Ori.livneh: nginx conf that routes requests to different services on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/265863 (owner: 10Subramanya Sastry) [18:34:14] (03CR) 10Ori.livneh: [C: 032 V: 032] nginx conf that routes requests to different services on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/265863 (owner: 10Subramanya Sastry) [18:34:30] (03PS3) 10Ori.livneh: parsoid-testing: rename classes with dashes [puppet] - 10https://gerrit.wikimedia.org/r/265873 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [18:34:37] (03CR) 10Ori.livneh: [C: 032 V: 032] parsoid-testing: rename classes with dashes [puppet] - 10https://gerrit.wikimedia.org/r/265873 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [18:36:59] hurrah! mutante you or ori will have to run puppet on ruthenium. [18:37:07] subbu: already running it [18:37:12] (03PS5) 10Dzahn: puppet-lint: rm exceptions for dashes in class names [puppet] - 10https://gerrit.wikimedia.org/r/260201 (https://phabricator.wikimedia.org/T93645) [18:37:16] ah, great. [18:37:21] Error: Could not set 'file' on ensure: No such file or directory @ dir_s_rmdir - /usr/lib/parsoid/src/tests/testreduce/parsoid-rt-client.rttest.localsettings.js20160125-11854-1ikum5w.lock at 11:/etc/puppet/manifests/role/parsoid_rt_client.pp [18:37:22] 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1962529 (10Cmjohnson) a:3Cmjohnson Papaul I a going to take this until I send you the disks later this week. [18:37:26] Error: Could not set 'file' on ensure: No such file or directory @ dir_s_rmdir - /usr/lib/parsoid/src/tests/testreduce/parsoid-rt-client.rttest.localsettings.js20160125-11854-1ikum5w.lock at 11:/etc/puppet/manifests/role/parsoid_rt_client.pp [18:37:54] grr .. bad path. i missed a /deploy/ there .. /usr/lib/parsoid/deploy/ [18:38:12] subbu: can you submit a patch? [18:38:20] yes .. about to. [18:39:01] (03CR) 10Dzahn: [C: 032] "verifies now after the last ones with dashes have been renamed :)" [puppet] - 10https://gerrit.wikimedia.org/r/260201 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [18:39:07] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [18:40:29] (03PS1) 10Subramanya Sastry: Fix bad path in role::parsoid_rt_client definition [puppet] - 10https://gerrit.wikimedia.org/r/266276 [18:40:39] ori ^ [18:40:39] (03PS2) 10Ori.livneh: Fix bad path in role::parsoid_rt_client definition [puppet] - 10https://gerrit.wikimedia.org/r/266276 (owner: 10Subramanya Sastry) [18:40:46] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix bad path in role::parsoid_rt_client definition [puppet] - 10https://gerrit.wikimedia.org/r/266276 (owner: 10Subramanya Sastry) [18:42:15] subbu: still failing -- https://dpaste.de/0741/raw [18:43:00] will take a look. [18:43:27] ah .. submodule of /usr/lib/parsoid/deploy hasn't been checked out. [18:43:42] which is what populates src/ there .. [18:43:53] might require a parameter to git::repo [18:44:04] err, git::clone [18:44:09] yeah [18:44:13] you need recurse_submodules => true [18:47:02] (03PS1) 10Subramanya Sastry: Add recurse_submodues => true while cloning parsoid in parsoid::testing role [puppet] - 10https://gerrit.wikimedia.org/r/266277 [18:47:21] (03PS2) 10Ori.livneh: Add recurse_submodules => true while cloning parsoid in parsoid::testing role [puppet] - 10https://gerrit.wikimedia.org/r/266277 (owner: 10Subramanya Sastry) [18:48:22] (03PS1) 10JGirault: Bump portals to master (remove A/B/C test from production) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266278 [18:49:13] 10Ops-Access-Requests, 6operations: add datacenter-ops to dhcp /install-server and allow to run puppet commands - https://phabricator.wikimedia.org/T123681#1962593 (10Papaul) Okay test and i was able to login in. Thanks. [18:49:14] (03PS3) 10Ori.livneh: Add recurse_submodules => true while cloning parsoid in parsoid::testing role [puppet] - 10https://gerrit.wikimedia.org/r/266277 (owner: 10Subramanya Sastry) [18:49:37] (03CR) 10Ori.livneh: [C: 032 V: 032] Add recurse_submodules => true while cloning parsoid in parsoid::testing role [puppet] - 10https://gerrit.wikimedia.org/r/266277 (owner: 10Subramanya Sastry) [18:49:48] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [18:50:06] * ori re-runs puppet on ruthenium [18:50:45] * subbu has written more puppet code in the last 2 weeks than in his entire life [18:52:37] PROBLEM - puppet last run on mw2096 is CRITICAL: CRITICAL: Puppet has 1 failures [18:53:42] subbu: https://dpaste.de/wMpE/raw [18:54:23] let me know if you get stuck and i'll try to think through the issue with you; for now i'm just mindlessly passing the logs to you and waiting for fix-ups ;) [18:55:21] (03CR) 10RobH: [C: 031] "seems sane and indeed the vhost doesn't actually resolve. This is listed for puppetswat on 2016-01-26." [puppet] - 10https://gerrit.wikimedia.org/r/265548 (owner: 10Alex Monk) [18:55:53] ori, paths are incorrect .. now that parsoid is checked out .. i'll go look at ruthenium and paths and fix them up. :) [18:56:37] my earlier adding /deploy/ is broken. but, i see that path used elsewhere too. so, will verify paths and fix. [18:56:57] (03CR) 10RobH: [C: 031] "This is slated for puppetswat per the deployments page on 2016-01-26." [puppet] - 10https://gerrit.wikimedia.org/r/265642 (owner: 10Alex Monk) [18:57:41] cmjohnson1: so the beta site reviews are easy cuz well... if beta breaks and we have to rollback during pupet swat its beta [18:57:45] its kind of what its for ;] [18:57:55] (03PS2) 10JGirault: Bump portals to master (remove A/B/C test from production) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266278 (https://phabricator.wikimedia.org/T124245) [18:57:58] k [18:58:01] its also why im not complaining to Krenair about these lacking phab tasks ;] [18:58:07] cuz beta. [18:58:08] !log removed unused wikiversions.cdb on mira and tin [18:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:58:28] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [18:58:55] cmjohnson1: you'll notice that ebernhardson listed an access group change to puppet swat [18:59:01] which is not really a puppetswat thing [18:59:22] ebernhardson: so im on clinic duty anyhow so i'm not ditching your patchset into the ether, but it likely wont stay on puppetswat [18:59:29] we try to keep access changes out of swats [18:59:45] (its detailed on our pupetswat page on wikitech) so i'll likely push it along outside of swat [18:59:58] moving that to access requests? or gen ops? [19:00:10] its already in access requests, im just going to pull it off the puppetswat schedule [19:00:32] in the copying of the schedule, somehow the link to the puppetswat wikitech page keeps getting stripped out [19:00:41] i've added back to schedule. [19:01:54] (03PS1) 10Subramanya Sastry: ruthenium services: /usr/lib/parsoid/deploy/src => /usr/lib/parsoid/src [puppet] - 10https://gerrit.wikimedia.org/r/266280 [19:01:56] access requests are the clinic person, and just cuz im doing both clinic and puppetswat doesn't mean we should let clinic patches into puppet swat =] [19:02:16] (03CR) 10Ori.livneh: [C: 032 V: 032] ruthenium services: /usr/lib/parsoid/deploy/src => /usr/lib/parsoid/src [puppet] - 10https://gerrit.wikimedia.org/r/266280 (owner: 10Subramanya Sastry) [19:04:26] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1962678 (10RobH) @EBernhardson, please note that I've removed the merging of the new group rights off puppetswat. Puppetswat shouldn't include any new access ri... [19:04:31] !log nfsd has 224 threads atm and was bumped up over the weekend [19:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:04:45] !log the nfsd thread change is on labstore1001 [19:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:04:56] (03CR) 10RobH: [C: 031] "This seems right to me, and the overall concept was approved today in our operations meeting. However, I'd like a second opsen to review " [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) (owner: 10EBernhardson) [19:05:06] !log labstore1001 temp change to CFQ scheduler on 01/22/2015 [19:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:45] (03CR) 10Mforns: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/266264 (https://phabricator.wikimedia.org/T124204) (owner: 10Elukey) [19:06:56] (03CR) 10RobH: [C: 031] "Please note this is slated for puppetswat on 2016-01-26. As the commit message points out, this CANNOT be merged without the relevant upd" [puppet] - 10https://gerrit.wikimedia.org/r/265942 (owner: 10EBernhardson) [19:07:44] subbu: there are two issues with the systemd unit files -- [19:07:55] subbu: first is: Jan 25 19:06:56 ruthenium systemd[1]: [/lib/systemd/system/parsoid-rt.service:12] Executable path is not absolute, ignoring: node server.js --conf...ttings.js [19:08:12] subbu: should be /usr/bin/nodejs instead [19:08:26] the second is Jan 25 19:03:47 ruthenium systemd[1]: [/lib/systemd/system/parsoid-rt.service:8] Failed to parse resource value, ignoring: 10K [19:08:41] I'm not sure why it's invalid -- the K suffix should be fine, according to http://www.freedesktop.org/software/systemd/man/systemd.exec.html#LimitCPU= [19:08:58] oh no [19:08:59] right, i thought so .. will change that to 10000 then. [19:09:00] it's not fine [19:09:01] The multiplicative suffixes K (=1024), M (=1024*1024) and so on for G, T, P and E may be used for resource limits measured in bytes (e.g. LimitAS=16G). [19:09:07] *measured in bytes* [19:09:13] ah, nm. [19:09:23] will fix both. [19:09:26] thanks [19:10:30] !log Live hacking on mw1017 to debug 1.27.0-wmf.11 issues. All wikis there currently set to use 1.27.0-wmf.11. [19:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:44] greg-g: ^ fyi [19:12:08] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0] [19:12:35] chasemp: can you ack the alerts? [19:13:18] yes that's mostly useless in teh context of labstore1003 especially [19:13:20] * greg-g nods [19:13:36] (03PS1) 10Subramanya Sastry: ruthenium services: Fix errors in systemd files [puppet] - 10https://gerrit.wikimedia.org/r/266282 [19:14:39] ori ^ [19:14:49] yep, already reviewing [19:14:53] (03CR) 10Ori.livneh: [C: 032] ruthenium services: Fix errors in systemd files [puppet] - 10https://gerrit.wikimedia.org/r/266282 (owner: 10Subramanya Sastry) [19:15:25] you are faster than me. :) [19:15:40] just more distractible [19:16:36] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1962800 (10Dzahn) @ssastry all the data has been copied back from osmium. i synced /home/ too and copied old data back from before the upgr... [19:17:46] subbu: allllllmost -- most things passed, but these errors remain: https://dpaste.de/PMom/raw [19:18:57] RECOVERY - puppet last run on mw2096 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:19:00] looking. [19:19:01] does it expect that config file to be in the private repo because of a secret in it? [19:19:05] bd808: just to be clear, that's for the auto-relogin issue? [19:19:09] (and has a placeholder in labs/private) [19:19:18] greg-g: yes [19:19:22] robh: thats fair, i wasn't sure where it belonged but figured puppetswat was a good way to at least get someone to look at it [19:19:37] ori: ty [19:20:15] greg-g: right now anomie has been working on https://phabricator.wikimedia.org/T124252 [19:20:29] ebernhardson: no worries, I +1'd it and if no other opsens take a look by tomorrow i'll hunt someone down [19:20:37] and i'll merge outside of puppetswat as clinic duty =] [19:20:41] He has a patch up for review now actually [19:21:05] * greg-g nods [19:21:07] subbu: there's no parsoid-vd-client.config.js in /modules/testreduce/files [19:21:34] paths issue .. i need to move files around. [19:21:53] i have 2 files in /visualdiff/ instead of in /testreduce/ ... [19:22:10] i have a team meeting in 8 minutes, if you can get a patch up before that i'll merge it and run puppet [19:22:57] (03PS1) 10Dzahn: ruthenium: remove rsyncd, was for upgrade only [puppet] - 10https://gerrit.wikimedia.org/r/266289 [19:24:04] (03PS2) 10Dzahn: ruthenium: remove rsyncd, was for upgrade only [puppet] - 10https://gerrit.wikimedia.org/r/266289 (https://phabricator.wikimedia.org/T122328) [19:24:40] (03PS1) 10Subramanya Sastry: ruthenium services: Move config files to modules/testreduce/ [puppet] - 10https://gerrit.wikimedia.org/r/266290 [19:24:42] there you go [19:24:48] * ori looks [19:25:00] * subbu crosses his fingers [19:25:04] (03CR) 10Ori.livneh: [C: 032 V: 032] ruthenium services: Move config files to modules/testreduce/ [puppet] - 10https://gerrit.wikimedia.org/r/266290 (owner: 10Subramanya Sastry) [19:28:00] (03PS3) 10Dzahn: ruthenium: remove rsyncd, was for upgrade only [puppet] - 10https://gerrit.wikimedia.org/r/266289 (https://phabricator.wikimedia.org/T122328) [19:29:28] subbu: good news and bad news [19:29:37] good news, puppet runs to completion successfully [19:30:16] bad news, the services die shortly after being launched. i don't have enough lines of log context in service status (and no logs via journalctl -u for some reason), but it looks like a failed import [19:30:17] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [19:30:19] Jan 25 19:26:31 ruthenium nodejs[30965]: at Module.require (module.js:366:17) [19:30:31] ori, thanks .. will take a look. [19:30:34] can't see lines above that [19:30:46] probably a node module path issue? [19:30:57] i can try starting them manually .. could be. [19:31:09] (03PS4) 10EBernhardson: Add alert for elasticsearch 50th percentile prefix search time [puppet] - 10https://gerrit.wikimedia.org/r/265942 (https://phabricator.wikimedia.org/T124542) [19:31:22] * subbu has to look up the systemd incantations first [19:32:15] !log eqiad: removing static routes for 6to4/Teredo to nitrogen (decommissioning our own relays) [19:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:27] ah, i cannot .. no permissions. mutante what is involved in get sudo perms to start/stop services on ruthenium and look at logs? [19:33:00] (03PS1) 10Eevans: match restbase config to current Cassandra cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/266291 (https://phabricator.wikimedia.org/T123869) [19:34:23] 6operations, 10netops: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1962960 (10faidon) p:5Triage>3Normal [19:35:13] 6operations, 10netops: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1936607 (10faidon) I removed the static routes from cr1/cr2-eqiad for the 6to4 and Teredo routes. Nitrogen shouldn't be used anymore and can be decommissioned (I adjusted the task description). [19:39:26] 6operations, 10netops: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963001 (10Dzahn) a:3Dzahn cool, thanks Faidon. i'll take the decom part [19:39:30] 6operations, 10RESTBase, 5Patch-For-Review: Reduce log spam by removing non-operational cassandra IPs from seeds - https://phabricator.wikimedia.org/T123869#1963000 (10Eevans) OK, the problem seems to be with `restbase::seeds`, what RESTBase passes to the Cassandra driver. That list is currently out of date... [19:39:36] (03CR) 10GWicke: [C: 04-1] "As discussed on the task, lets first test this in staging." [puppet] - 10https://gerrit.wikimedia.org/r/266291 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [19:40:02] ori, mutante ah .. looks like /srv/testreduce is still on the master branch ... (probably because it had already been checked out in an older puppet run)? .. so, node_modules haven't been checked out [19:40:15] ori, i know you are in a meeting. so, this is for later. [19:40:54] but /srv/visualdiff is on ruthenium branch and has node_modules ... i imagine because this came from patches today? anyway, just guessing. [19:41:45] subbu: do you need a fix by a root user or something urgent? [19:42:06] it is not urgent. [19:42:13] alright [19:42:31] i'm out of the meeting [19:42:33] i can fix it [19:42:57] 6operations, 10RESTBase, 5Patch-For-Review: Reduce log spam by removing non-operational cassandra IPs from seeds - https://phabricator.wikimedia.org/T123869#1963016 (10Eevans) > IIRC the list of seeds is also used to set up the cassandra firewall, so we need to be careful about how removing main node IPs for... [19:44:34] okay, I moved the whole of /usr/lib/parsoid out of the way so that it gets recreated by puppet [19:45:01] (03PS2) 10Eevans: match restbase config to current Cassandra cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/266291 (https://phabricator.wikimedia.org/T123869) [19:45:19] ori, /srv/testreduce [19:45:37] that is the one. [19:46:08] ah, gotcha [19:46:34] 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1963049 (10EBernhardson) If my math is right, a 100% increase in 12 months extrapolated to 18 months gives current capacity = 1 increase in 12 months: 2x increase in following 6 months: 1.... [19:49:11] 6operations: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#1963069 (10Dzahn) @elukey per chat in the ops meeting. ask me anytime about where to start to get a serial console, use installserver etc. [19:49:53] (03CR) 10Dzahn: [C: 032] ruthenium: remove rsyncd, was for upgrade only [puppet] - 10https://gerrit.wikimedia.org/r/266289 (https://phabricator.wikimedia.org/T122328) (owner: 10Dzahn) [19:50:23] (03Abandoned) 10Eevans: match restbase config to current Cassandra cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/266291 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [19:50:30] subbu: WorkingDirectory=/usr/lib/testreduce/client in /lib/systemd/system/parsoid-vd-client.service [19:50:34] subbu: should be /srv, no? [19:51:37] ya. [19:51:45] fixing. [19:52:12] i ran puppet and stopped rsyncd on ruthenium [19:52:19] mutante: thanks [19:52:20] cleaning up what was just there for migration [19:52:28] (03PS1) 10Eevans: [production]: match restbase config to current Cassandra cluster Bug: T123869 [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) [19:53:17] (03PS1) 10Eevans: [staging]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266299 (https://phabricator.wikimedia.org/T123869) [19:53:34] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#1963095 (10jcrespo) [19:53:51] 6operations, 10DBA, 7Performance, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#1963095 (10jcrespo) [19:54:00] (03PS1) 10Subramanya Sastry: Fix working directory for testreduce clients on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/266300 [19:54:05] (03CR) 10Eevans: [C: 04-1] "We'll try this in staging first (https://gerrit.wikimedia.org/r/#/c/266299/)" [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [19:54:29] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix working directory for testreduce clients on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/266300 (owner: 10Subramanya Sastry) [19:55:10] <_joe_> ori: are you using trebuchet in any ways there? please let me know if you run into issues [19:55:18] (03PS2) 10Eevans: [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) [19:55:47] no trebuchet afaik [19:55:58] <_joe_> heh ok, I wanted more guinea pigs [19:56:30] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Setup circular replication (dallas -> eqiad) for databases - https://phabricator.wikimedia.org/T124698#1963123 (10jcrespo) 3NEW [19:57:34] 6operations, 10RESTBase, 5Patch-For-Review: Reduce log spam by removing non-operational cassandra IPs from seeds - https://phabricator.wikimedia.org/T123869#1963131 (10Eevans) I've separated this into two gerrits to better facilitate testing in staging, https://gerrit.wikimedia.org/r/266299 (cleanup `restbas... [19:57:38] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1963132 (10jcrespo) [19:57:43] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1963135 (10Dzahn) [19:57:46] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1963134 (10Dzahn) 5Open>3Resolved [19:58:40] Jan 25 19:03:18 ruthenium systemd[1]: parsoid.service lacks ExecStart setting. Refusing. [19:58:47] subbu: ^ [19:59:21] i guess live debugging is the puppet way .. looking. [20:00:32] subbu: one more place to fix paths: Jan 25 19:49:04 ruthenium nodejs[7940]: Aborting! Exception reading /etc/testreduce/parsoid-rt.settings.js: Error: Cannot find module '/usr/lib/te...stats.js' [20:00:45] [/lib/systemd/system/parsoid.service:8] Failed to parse resource value, [20:00:53] I see ExecStart=/usr/bin/nodejs bin/server.js -c /usr/lib/parsoid/src/localsettings.js in /lib/systemd/system/parsoid.service [20:01:02] ah. bug. i see. [20:01:19] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1963151 (10jcrespo) [20:01:34] 6operations, 6Parsing-Team, 10Parsoid, 6Services: parsoid roles: convert upstart to systemd - https://phabricator.wikimedia.org/T124480#1963152 (10Dzahn) These have been merged, so there are systemd files now, but currently we see this: ``` root@ruthenium:/etc/systemd# service parsoid status ● parsoid.se... [20:02:14] subbu: yeah; run: git grep 'usr/lib' -- modules/testreduce [20:02:24] hmm .. it doesn't like LimitNOFILE=10000? [20:02:45] i think the error mutante saw was because we hadn't run systemdctl daemon-reload [20:02:52] but the config path issue is real and current [20:03:04] *systemctl [20:03:50] subbu: i commented that line and it did not make a difference.. what ori said then i guess [20:04:49] ori: hmmm, tried.. but it's the same [20:04:58] i just do "service parsoid status" [20:05:26] let's fix that config path first [20:05:39] (03PS1) 10Subramanya Sastry: ruthenium: Fix bad paths in parsoid-rt-settings.js.erb [puppet] - 10https://gerrit.wikimedia.org/r/266303 [20:05:59] (03PS1) 10Giuseppe Lavagetto: lvs: convert ESAMS backup load balancers to etcd [puppet] - 10https://gerrit.wikimedia.org/r/266304 [20:06:01] (03PS1) 10Giuseppe Lavagetto: lvs: enable etcd for pybal on all ESAMS balancers [puppet] - 10https://gerrit.wikimedia.org/r/266305 [20:06:32] (03CR) 10Dzahn: [C: 032] "confirmed path" [puppet] - 10https://gerrit.wikimedia.org/r/266303 (owner: 10Subramanya Sastry) [20:06:57] (03CR) 10Dzahn: [V: 032] ruthenium: Fix bad paths in parsoid-rt-settings.js.erb [puppet] - 10https://gerrit.wikimedia.org/r/266303 (owner: 10Subramanya Sastry) [20:07:28] (03CR) 10GWicke: [C: 031] [staging]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266299 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [20:07:48] 6operations, 6Parsing-Team: Getting parsing-team memebrs sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1963172 (10ssastry) 3NEW [20:08:02] 6operations, 6Parsing-Team: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1963179 (10ssastry) [20:08:14] subbu: it started :) [20:08:29] ● parsoid.service - parsoid-vd: Testreduce HTTP service for visual-diff results Loaded: loaded (/lib/systemd/system/parsoid.service; static) Active: active (running) since Mon 2016-01-25 20:08:02 UTC; 3s ago [20:08:32] \o/ [20:08:49] what about parsoid-rt and parsoid-rt-clients? [20:09:24] Jan 25 20:07:38 ruthenium systemd[1]: parsoid-rt.service: main process exited, code...RE [20:09:27] Jan 25 20:07:38 ruthenium systemd[1]: Unit parsoid-rt.service entered failed state. [20:09:30] Jan 25 20:07:38 ruthenium nodejs[13953]: Unable to connect to database, error: Erro...06 [20:09:34] database connection [20:09:55] is mysql up? [20:09:59] mutante: doesn't make much sense for both of us to be debugging this -- if you're helping subbu, i'll switch to something else, if that's ok with you [20:10:00] RT test server and testreduce test server are running [20:10:10] (03PS4) 10EBernhardson: Create new puppet group analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) [20:12:16] (03PS1) 10BryanDavis: Only send warning and higher session logs to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266307 [20:12:41] (03CR) 10Ottomata: [C: 031] Create new puppet group analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) (owner: 10EBernhardson) [20:13:24] 6operations, 6Parsing-Team: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1963191 (10ssastry) Also need access to open mysql CLI and view / modify the databases used in testing. [20:15:22] !log bump labstore nfs threads to 288 from 244 [20:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:22] ori: ok. sorry, i just went on it to remove the rsyncd [20:15:22] subbu: there is no mysql installed [20:15:22] mutante: no apologies needed! :) [20:15:22] oh .. so, how do i get mysql there then? :) [20:15:24] i assumed it was "default install" .. but i guess there isn't such a thing. [20:15:28] (03PS1) 10Hoo man: Set $wgWikimediaBadgesCommonsCategoryProperty for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266308 (https://phabricator.wikimedia.org/T124403) [20:15:37] subbu: include one of the mysql classes in the role class on ruthenium, ideally [20:16:00] i'm not sure yet which one though [20:16:03] some manual setup is usually required; we don't make puppet do mysql grants or table creation [20:16:12] but the packages at least are provisioned by puppet [20:16:18] (03CR) 10Hoo man: [C: 032] "beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266308 (https://phabricator.wikimedia.org/T124403) (owner: 10Hoo man) [20:16:27] are we talking mariadb or actual mysql? [20:16:36] i don't think subbu minds [20:16:45] (03Merged) 10jenkins-bot: Set $wgWikimediaBadgesCommonsCategoryProperty for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266308 (https://phabricator.wikimedia.org/T124403) (owner: 10Hoo man) [20:16:50] how big is the dataset, and how important? should it be hosted on one of the production database clusters? [20:17:00] ya, i don't care as long as the services can connect to them. [20:17:01] no, doesn't have to be. [20:17:08] (cc jynus, in case I am misstating anything) [20:17:10] or we could request it from the DBA [20:17:25] and ask if it should run locally or use an external db* [20:17:37] wat? [20:17:52] everything had been running on ruthenium all this while, so i assumed after the upgrade, it will live there, but, i don't care. [20:18:08] mysql db for parsoid testing -- running on bare metal in prod, but not user-facing, and if data is lost it's not the end of the world [20:18:24] where would you prefer it to live (on ruthenium or some other host) and how would you like it configured? [20:18:26] testing is the key word here [20:19:08] is this like a copy of production or what? [20:19:12] 6operations, 6Parsing-Team, 10Parsoid, 6Services: parsoid roles: convert upstart to systemd - https://phabricator.wikimedia.org/T124480#1963207 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/266303/ the parsoid service runs now ``` root@ruthenium:/etc/systemd# service parsoid status ● parsoid.service -... [20:19:15] !log hoo@mira Synchronized wmf-config/Wikibase-labs.php: (no message) (duration: 01m 28s) [20:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:19:28] jynus, no. [20:19:51] it is a custom db with a fairly simple schema [20:20:22] probably the best thing is to add a ticket with all background- plus requirements in space and load (+predictions of growth) [20:20:25] it holds results from roundtrip tests ... about 300K rows created per test run (which can be a few itmes a week) .. so, over time, it builds up. [20:20:39] so, it is like a performance backlog? [20:20:56] it's for diffing the output of parsoid and mediawiki [20:21:04] a, ok [20:21:09] no, also for diffing between test runs of parsoid .. [20:21:11] so not mediawiki production [20:21:17] misc, or something else [20:21:18] right [20:21:19] we don't deploy parsoid before we verify tests there. [20:21:31] internal service, basically [20:21:36] visualeditor, content translation flow all depend on parsoid. [20:21:40] yep [20:21:47] and mobile content services. [20:21:51] yes, internal service. [20:22:17] please add that (which I already understood), and give an estimation of size and load ,and its growth [20:22:34] and I will give you some options [20:22:48] PROBLEM - Disk space on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:22:50] cc DBA [20:22:57] PROBLEM - dhclient process on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:22:57] PROBLEM - salt-minion processes on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:07] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:17] PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:17] PROBLEM - RAID on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:17] PROBLEM - puppet last run on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:18] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1963228 (10Dzahn) [20:23:24] 6operations, 6Parsing-Team, 10Parsoid, 6Services: parsoid roles: convert upstart to systemd - https://phabricator.wikimedia.org/T124480#1963225 (10Dzahn) 5Open>3Resolved a:3Dzahn parsoid-rt-client is running to: ``` parsoid-rt-client.service - Testing test client for Parsoid rt-testing Loaded:... [20:23:32] 6operations, 6Parsing-Team, 10Parsoid, 6Services: parsoid roles: convert upstart to systemd - https://phabricator.wikimedia.org/T124480#1963230 (10Dzahn) a:5Dzahn>3ssastry [20:24:20] alsafi is a ganeti VM for url-downloader in codfw [20:24:30] just recently added [20:25:01] we've seen these timeouts before for ganeti machines.hhhrmm [20:25:20] jynus, i am creating the ticket now, but mutante copied over the entire database that used to reside on ruthenium (prior to reimaging it from ubuntu 12.04 to jessie) if you want to take a look independently. [20:25:38] where? [20:25:55] jynus: root@ruthenium:/mnt/data/mysql# [20:26:09] ah, withing the server [20:26:21] i just copied the whole /mnt/data [20:26:31] it was like that before, yep [20:26:59] i remember checking the mysql version should stay the same on jessie [20:27:09] 6operations: labtestservices2001.wikimedia.org.crt - https://phabricator.wikimedia.org/T124374#1963258 (10Andrew) moritzm> the ca is the one here: root@palladium:/srv/private/modules/secret/secrets/ssl/wmf_ca_2014_2017 [20:27:39] 145 GB [20:27:43] greg-g: So I think the options for .11 are to (a) wait for .12 train to start tomorrow and ride it, (b) hold .12 train and run .11 through the whole train process again this week, or (c) try to run a very quick train progression with .11 today and tomorrow. [20:27:53] that is considerable already [20:28:00] jynus, https://github.com/wikimedia/mediawiki-services-parsoid-testreduce/blob/master/server/sql/create_everything.mysql is the schema [20:28:07] we can purge old results periodically. [20:28:33] and keep results from previous 3-6 months ... wihch will keep it contained. [20:28:40] greg-g: due to the wall clock time and anomie's work hours I'm not very comfortable going past group0 with .11 today [20:29:13] tgr|away, anomie: please add your opinions if they differ [20:29:27] can you add all of that to a ticket, I will have to check the best place and that will take me a while [20:29:29] ? [20:29:47] Personally, I like option (a). [20:29:53] +1 for a [20:30:14] (03PS1) 10Papaul: Decommission:Remove nitrogen Bug:T123732 [puppet] - 10https://gerrit.wikimedia.org/r/266310 (https://phabricator.wikimedia.org/T123732) [20:33:37] jouncebot: refresh [20:33:39] I refreshed my knowledge about deployments. [20:38:06] (03PS1) 10Papaul: Decommission: Removed nitrogen [puppet] - 10https://gerrit.wikimedia.org/r/266311 [20:41:01] 6operations: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1963309 (10ssastry) 3NEW [20:41:30] (03PS1) 10Jcrespo: New database requested by MarkTraceur (project_ilustration) [puppet] - 10https://gerrit.wikimedia.org/r/266312 [20:42:21] jynus, I created T124703 for you. [20:43:22] 6operations: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1963309 (10Dzahn) [20:43:26] 6operations, 10Parsoid: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1963342 (10ssastry) [20:43:37] subbu, thanks [20:43:40] * ori finds that sqlite is often a great solution for such use-cases [20:44:05] ori, we had sqlite originally [20:44:35] space is not such a big issue, load is more [20:44:36] that was what we built this on .. and then it started crawling pretty badly on the volume of data we had .. [20:44:44] ah [20:44:53] this is not performance critical since we only hit it a few times a week. [20:45:53] (03PS2) 10Jcrespo: New database requested by MarkTraceur (project_ilustration) [puppet] - 10https://gerrit.wikimedia.org/r/266312 (https://phabricator.wikimedia.org/T124705) [20:47:54] (03PS1) 10ArielGlenn: dumps: rebalance page ranges for dump jobs that run in parallel [puppet] - 10https://gerrit.wikimedia.org/r/266314 (https://phabricator.wikimedia.org/T123571) [20:49:13] (03PS1) 10Andrew Bogott: Added cert for ldap test server labtestservices2001. [puppet] - 10https://gerrit.wikimedia.org/r/266315 [20:49:30] (03PS2) 10Andrew Bogott: Added cert for ldap test server labtestservices2001. [puppet] - 10https://gerrit.wikimedia.org/r/266315 (https://phabricator.wikimedia.org/T124374) [20:49:33] (03CR) 10Mobrovac: [C: 031] [staging]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266299 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [20:49:36] (03CR) 10Jcrespo: [C: 032] New database requested by MarkTraceur (project_ilustration) [puppet] - 10https://gerrit.wikimedia.org/r/266312 (https://phabricator.wikimedia.org/T124705) (owner: 10Jcrespo) [20:49:53] (03CR) 10jenkins-bot: [V: 04-1] dumps: rebalance page ranges for dump jobs that run in parallel [puppet] - 10https://gerrit.wikimedia.org/r/266314 (https://phabricator.wikimedia.org/T123571) (owner: 10ArielGlenn) [20:52:09] (03PS1) 10Dzahn: phabricator: don't use communitymetrics@, use wikitech [puppet] - 10https://gerrit.wikimedia.org/r/266316 (https://phabricator.wikimedia.org/T123581) [20:52:47] 6operations, 10Parsoid: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1963382 (10ssastry) [20:55:18] (03PS1) 10Ottomata: Puppetize offset.retention.minutes [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266318 [20:57:49] (03PS3) 10Andrew Bogott: Added cert for ldap test server labtestservices2001. [puppet] - 10https://gerrit.wikimedia.org/r/266315 (https://phabricator.wikimedia.org/T124374) [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160125T2100). Please do the needful. [21:00:05] marxarelli: twentyafterfour thcipriani ostriches ^^ see above re train ideas for this week [21:00:12] (from bd.808) [21:00:14] (03PS2) 10Ottomata: Puppetize offset.retention.minutes [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266318 [21:00:47] (03PS3) 10Ottomata: Puppetize offset.retention.minutes [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266318 [21:00:54] Did I miss some scrollback? Hmm [21:00:57] * ostriches looks harder [21:01:17] Ah found it. [21:02:19] fwiw, I think it would be good to get wmf.11 serving some live traffic today if possible. Might miss some compounding issues just moving to wmf.12. My $0.02. [21:02:21] (03PS4) 10Ottomata: Puppetize offset.retention.minutes [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266318 [21:02:55] +1 to thcipriani, at least something on wmf.11 for a bit, without incident, would make me really happy [21:03:03] 6operations, 7Mail, 7Mobile, 5Patch-For-Review: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#1963458 (10Dzahn) [21:03:26] +1. I'd say push .12 back to next week and roll .11 out this week. [21:03:30] (03PS5) 10Ottomata: Puppetize offset.retention.minutes [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266318 [21:03:44] 6operations, 10Parsoid: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1963461 (10jcrespo) 2 questions: * Need for backups? * Can I bring the service down for maintenace? [21:03:54] 6operations, 7Mail, 7Mobile, 5Patch-For-Review: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#1933178 (10Dzahn) @mobile do you guys need mobile@wikimedia.org , the email address? It used to be this: mobile: mobile-feedback-l@lists.wikimedia.org Is that sti... [21:03:59] Compressing the .11 cycle feels rushed, and skipping it for .12 just makes a bigger delta [21:04:15] * greg-g nods [21:04:20] bigger deltas are bad [21:04:22] (03CR) 10Ottomata: [C: 032] Puppetize offset.retention.minutes [puppet/kafka] - 10https://gerrit.wikimedia.org/r/266318 (owner: 10Ottomata) [21:04:46] ostriches: just to be clear, will wmf.12 not be cut this week then? (ie: are we just pushing the bigger delta one week later?) [21:04:55] I might take issue with pushing .12 back, but I would definitely like to roll .11 out somewhere just to verify fixes for the issues seen with .11 (and to verify new issues were not introduced) [21:04:57] would* [21:05:18] Bleh, I suppose that is true. [21:05:38] * MatmaRex braces for all the backports [21:05:59] (03PS1) 10Ottomata: Update kafka submodule with offsets.retention.minutes [puppet] - 10https://gerrit.wikimedia.org/r/266324 [21:06:08] (03PS3) 10Eevans: [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) [21:06:20] (03PS2) 10Ottomata: Update kafka submodule with offsets.retention.minutes [puppet] - 10https://gerrit.wikimedia.org/r/266324 [21:06:38] could push out .11 to group0(+1?) today, branch-cut tomorrow, push .11 forward ahead of the .12 train? [21:07:10] I mean, I guess depending on how wmf.11 behaves? fixes have been backported to that branch, right? [21:07:32] not since it was rolled back :P [21:07:39] it's going to take hours of just waiting for Jenkins [21:07:54] (03PS4) 10Eevans: [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) [21:07:56] (03CR) 10Andrew Bogott: [C: 032] Added cert for ldap test server labtestservices2001. [puppet] - 10https://gerrit.wikimedia.org/r/266315 (https://phabricator.wikimedia.org/T124374) (owner: 10Andrew Bogott) [21:08:06] (i am exagerrating but only slightly) [21:08:24] (03PS1) 10Andrew Bogott: Move some ldap host/cert name stuff into hiera. [puppet] - 10https://gerrit.wikimedia.org/r/266325 [21:08:28] We should start backports now ;-) [21:08:42] 6operations, 10Parsoid: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1963477 (10ssastry) * No backups required. * Yes. But, please check in with us before doing so just in case we are running tests or checking test results for deployment. [21:08:54] thcipriani: a la https://etherpad.wikimedia.org/p/wmf11and12 ? [21:09:00] (03PS3) 10Ottomata: Update kafka submodule with offsets.retention.minutes [puppet] - 10https://gerrit.wikimedia.org/r/266324 [21:09:09] (03CR) 10Ottomata: [C: 032 V: 032] Update kafka submodule with offsets.retention.minutes [puppet] - 10https://gerrit.wikimedia.org/r/266324 (owner: 10Ottomata) [21:09:30] (03PS5) 10Eevans: [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) [21:10:41] greg-g: that was my thinking. [21:10:49] (03PS1) 10Yuvipanda: toollabs: Include exec_environ for cron runner host [puppet] - 10https://gerrit.wikimedia.org/r/266329 [21:10:53] chasemp: ^ [21:10:54] (03PS6) 10Eevans: [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) [21:10:56] gonna merge now [21:11:08] (03PS2) 10Yuvipanda: toollabs: Include exec_environ for cron runner host [puppet] - 10https://gerrit.wikimedia.org/r/266329 [21:11:12] 6operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#1963494 (10Dzahn) I received a Thinkpad T420, WMF 389, from OIT (thanks!) and i'll install jessie on it next. [21:11:18] (03CR) 10Rush: [C: 031] toollabs: Include exec_environ for cron runner host [puppet] - 10https://gerrit.wikimedia.org/r/266329 (owner: 10Yuvipanda) [21:11:20] ostriches: bd808 MatmaRex: thoughts? https://etherpad.wikimedia.org/p/wmf11and12 [21:11:23] Worksforme [21:11:39] (03PS2) 10Andrew Bogott: Don't send puppet nags to the novaadmin user. [puppet] - 10https://gerrit.wikimedia.org/r/266192 (https://phabricator.wikimedia.org/T124516) [21:11:41] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Include exec_environ for cron runner host [puppet] - 10https://gerrit.wikimedia.org/r/266329 (owner: 10Yuvipanda) [21:12:33] anomie, tgr|away: proposal from greg-g https://etherpad.wikimedia.org/p/wmf11and12 [21:12:52] greg-g: i'm just heckling :) [21:12:56] greg-g: one issue with today is that anomie will be offline very soon [21:13:08] MatmaRex: I appreciate everyone's input :) [21:13:15] * aude waves [21:13:26] bd808: yeah, that's the biggest drawback, afaict [21:14:44] Also, someone will need to backport a bunch of patches to wmf.11. That someone is not me. [21:15:39] anomie: is that list known? [21:15:48] greg-g: we can make the list [21:16:09] i.e. make the cherry-picks [21:16:17] there will be l10n lag too [21:16:32] * aude still digging out of the snow + hopes to travel tomorrow, so won't be around much to help with debugging anything [21:16:38] l10nupdate hasn't been updating .11 [21:17:04] but hope we get things moving and we do want to deploy new code for wikidata this week [21:17:44] (but ok if that doesn't work out) [21:17:58] PROBLEM - SSH on alsafi is CRITICAL: Server answer [21:18:07] does option 2 (same etherpad) make people feel more safe? does it change the doability of today's roll of wmf.11 to anywhere? [21:18:22] so lots of jenkins waiting and a full scap. That seems doable. No one to debug if something goes badly. [21:18:23] 6operations, 10RESTBase, 5Patch-For-Review: Reduce log spam by removing non-operational cassandra IPs from seeds - https://phabricator.wikimedia.org/T123869#1963534 (10Eevans) The staging changeset has been added to tomorrow mornings Puppet SWAT (https://wikitech.wikimedia.org/wiki/Deployments#Week_of_Januar... [21:18:32] * aude likes option 2 [21:18:45] option 2 won't allow for much exposure of wmf.11 before it hits enwiki [21:19:12] marxarelli: right, we would probably want to start the rollout earlier tomorrow and take a while to do it [21:19:16] (03PS2) 10ArielGlenn: dumps: rebalance page ranges for dump jobs that run in parallel [puppet] - 10https://gerrit.wikimedia.org/r/266314 (https://phabricator.wikimedia.org/T123571) [21:19:16] i hope to be around some tomorrow and hoo would be around to help monitor any issues on wikidata [21:19:30] I'll be around tomorrow evening [21:19:41] greg-g: option 2 looks nicer. that keeps .11 off of loginwiki for a bit longer [21:19:43] if the deploy happens as usual (time wise) [21:20:08] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [21:20:09] greg-g: it also gives time for l10nupdate to catch up [21:20:15] * greg-g nods [21:20:18] (03PS2) 10Dzahn: Decommission:Remove nitrogen from puppet [puppet] - 10https://gerrit.wikimedia.org/r/266310 (https://phabricator.wikimedia.org/T123732) (owner: 10Papaul) [21:20:30] ostriches: thcipriani marxarelli: votes from you on Option 2? :) [21:20:32] depending how things go, wmf11 for enwiki on wednesday [21:20:46] (03PS3) 10Dzahn: Decommission:Remove nitrogen from puppet [puppet] - 10https://gerrit.wikimedia.org/r/266310 (https://phabricator.wikimedia.org/T123732) (owner: 10Papaul) [21:20:51] if we still have problems, them maybe wmf11 for enwiki on thursday and wmf12 on monday? [21:20:52] So we're going with option #2? [21:20:55] also wfm. [21:20:57] (03PS1) 10Ori.livneh: Add a service alias for mw1017 (app server debug backend) [dns] - 10https://gerrit.wikimedia.org/r/266332 [21:21:08] hoo: waiting on the votes to come in, it's like election night [21:21:11] aude: Shall I branch Wikibase then, as planned [21:21:15] greg-g: wfm. [21:21:17] hoo: yeah [21:21:19] +1 on #2 [21:22:02] marxarelli: you're the last delegate to vote, Clinton or Bernie? [21:22:09] (03CR) 10Dzahn: [C: 032] Decommission:Remove nitrogen from puppet [puppet] - 10https://gerrit.wikimedia.org/r/266310 (https://phabricator.wikimedia.org/T123732) (owner: 10Papaul) [21:22:17] * marxarelli has only an illusion of choice [21:22:34] um, i would vote for just getting wmf.11 out this week, safely :) [21:22:47] marxarelli: agree [21:22:49] (03PS2) 10Dzahn: Decommission: Remove nitrogen from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/266311 (owner: 10Papaul) [21:22:57] (03CR) 10EBernhardson: "PS4 changes the role to role::elasticsearch::analytics at ottomata's request" [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) (owner: 10EBernhardson) [21:23:00] i mean, what's the rush with wmf.12? [21:23:00] which is why i say see how it goes [21:23:06] marxarelli: just to be clear, that means having 2 weeks of stuff to go out in wmf.12 starting next week [21:23:42] sure, but 2 weeks of stuff is not necessarily more delta than 1 week of substantial changes [21:23:49] (03PS1) 10Yuvipanda: toollabs: Symlink jlocal in /usr/bin [puppet] - 10https://gerrit.wikimedia.org/r/266334 [21:23:55] not every diff is the same, fair [21:24:04] and we've already seen quite a bit of change in wmf.11 [21:24:37] Dan is talking sensibly. [21:24:51] also, deploying wmf.11 and wmf.12 simultaneously means which have to track two deltas [21:25:06] right... [21:25:20] (03PS1) 10Hoo man: Disable commons category sidebar link overwrite in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266335 (https://phabricator.wikimedia.org/T124403) [21:25:24] ok, fuck it, let's just roll with wmf.11 pretending tomorrow is last tuesday [21:25:28] aude: ^ quick +1, please [21:25:32] (03PS2) 10Yuvipanda: toollabs: Symlink jlocal in /usr/bin [puppet] - 10https://gerrit.wikimedia.org/r/266334 [21:25:53] holy crap, do-overs _are_ real! [21:25:53] aude: how much of a pain is it for you all to wait? [21:25:56] :) [21:26:06] 6operations, 10Parsoid: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1963553 (10jcrespo) Second part is assumed always. Let's settle it on m5-master, if it requires too much load, we will need to create m6. [21:26:26] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Symlink jlocal in /usr/bin [puppet] - 10https://gerrit.wikimedia.org/r/266334 (owner: 10Yuvipanda) [21:27:07] greg-g: that will work I guess. I'll need to prep the backports either way [21:27:30] I don't have anything against .12 moving forward for the record [21:27:43] 6operations, 10netops: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963555 (10Dzahn) used this as an example to go through decom process with @papaul he made https://gerrit.wikimedia.org/r/#/c/266310/ and https://gerrit.wikimedia.org/r/#/c/266311/ [21:28:29] I like having ano.mie around tomorrow to debug things again [21:28:53] the the train will need to run at a reasonable time compared to EST [21:29:11] rather than just barely getting started by the time he goes offline [21:29:37] if we're just backporting and syncing wikiversions, i can get the train rolling earlier [21:29:40] marxarelli: can you do that (start the train with wmf.11 at a time brad can help debug from the east coast) [21:29:43] * greg-g nods [21:29:47] 6operations, 6Parsing-Team: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1963558 (10ssastry) >>! In T124701#1963191, @ssastry wrote: > Also need access to open mysql CLI and view / modify the databases used in testin... [21:29:48] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [21:30:02] marxarelli: true. it should cut out a lot of setup time [21:30:13] as early as 1700 utc [21:30:22] !log nitrogen - stop puppet, stop salt, remove from stored configs / icinga [21:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:30:42] papaul: ^ did that step of the nitrogen decom [21:30:56] (03PS3) 10Papaul: Decommission: Remove nitrogen from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/266311 [21:31:04] ok, any objections? (seriously, i'm obviously ok with logic changing my mind) [21:31:33] (03PS4) 10Papaul: Decommission: Remove nitrogen from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/266311 [21:31:33] marxarelli: To clarify, "east coast" means I start my workday at 14:00 UTC and end at 22:00 UTC. 17:00 UTC is noon for me, leaving 5 hours of me being around. [21:32:07] I can only be around 18:30 UTC onwards, but I don't expect Wikidata difficulties if we go with #2 [21:32:26] (03CR) 10Dzahn: [C: 032] Decommission: Remove nitrogen from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/266311 (owner: 10Papaul) [21:32:50] hoo: just to be clear, the plan on the table is now: "Pretend it is last week and roll wmf.11 like it should have been" [21:33:07] greg-g: Uhm... so no wmf12? [21:33:26] next week [21:33:39] is there a list of backports that need picking? if we can get that done by today, i can sync versions right after my morning cup of coffee (~ 1615 UTC) [21:33:42] hm... we could still create a new wikidata wmf11 branch [21:33:49] by *eod* today [21:33:55] there isn't a list [21:34:17] but i imagine you'd start by looking at everything anomie committed in the last days [21:34:26] marxarelli: I will have the backports in gerrit by the end of my day today [21:35:00] aude: Lydia_WMDE: What do you think? New wmf11 Wikidata branch? [21:35:07] bd808: radical [21:36:17] PROBLEM - salt-minion processes on nitrogen is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:37:07] ok, thus it is so. I'll reply to bd's email with the status [21:37:53] (03PS2) 10Yuvipanda: Install libbytes-random-secure-perl on tool labs [puppet] - 10https://gerrit.wikimedia.org/r/264440 (https://phabricator.wikimedia.org/T123824) (owner: 10Anomie) [21:37:55] !log mobileapps deployed 9252a22 [21:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:38:01] (03CR) 10Yuvipanda: [C: 032 V: 032] Install libbytes-random-secure-perl on tool labs [puppet] - 10https://gerrit.wikimedia.org/r/264440 (https://phabricator.wikimedia.org/T123824) (owner: 10Anomie) [21:38:44] (03PS3) 10Yuvipanda: apt: Remove extra space in sources.list [puppet] - 10https://gerrit.wikimedia.org/r/263380 (owner: 10Tim Landscheidt) [21:38:51] (03CR) 10Yuvipanda: [C: 032 V: 032] apt: Remove extra space in sources.list [puppet] - 10https://gerrit.wikimedia.org/r/263380 (owner: 10Tim Landscheidt) [21:38:52] ACKNOWLEDGEMENT - salt-minion processes on nitrogen is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion daniel_zahn decom [21:41:27] hoo: don't count on my availability this week [21:42:02] I can be around myself [21:42:03] * aude travelling tomorrow and probably needs to take some holidays [21:42:11] well, unless people decide to move deploys around like crazy [21:42:21] back on thursday in the office, but who knows and will be tired [21:42:41] jonas is travelling all the way across the ocean to the UK :/ [21:42:48] jonas the blizzard :D [21:42:59] RECOVERY - Disk space on alsafi is OK: DISK OK [21:43:00] Oh, that Jonas :D [21:43:03] yeah [21:43:08] RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient [21:43:08] RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:43:17] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [21:43:18] so potentially a problem again [21:43:23] ^ again, i did that by _just logging on _ [21:43:27] If the deploys happen at the regular times, that is perfectly fine for me, I arranged my week for that to work [21:43:32] hoo: ok with me [21:43:37] RECOVERY - RAID on alsafi is OK: OK: no RAID installed [21:43:37] RECOVERY - DPKG on alsafi is OK: All packages OK [21:43:37] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [21:43:57] 6operations, 10Parsoid: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1963601 (10jcrespo) Does mariadb 10 work for you? [21:45:23] !log alsafi - was reported down in icinga , is ganeti VM - fixed by just logging in as if it went to hibernate [21:45:24] 6operations, 10Parsoid: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1963605 (10ssastry) I assume the mysql schema is compliant with it? If so, yes. [21:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:45:27] greg-g: So deployment times stay as announced on Wikitech? [21:45:49] If not, we might want to scap a Wikidata update to group0 ourselves tomorrow [21:46:07] marxarelli: what'd you decide on timing tomorrow? [21:46:10] (03PS2) 10Andrew Bogott: Move some ldap host/cert name stuff into hiera. [puppet] - 10https://gerrit.wikimedia.org/r/266325 [21:47:28] greg-g: haven't made a firm decision yet. doing it earlier would mean more anomie availability for debugging, but maybe we can find a sweet spot? [21:47:36] !log nitrogen - shutdown -h now .... [21:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:48:14] greg-g: While you're around... I would like to push https://gerrit.wikimedia.org/r/266335 very quick. No-op for now (just wanting to make sure we don't accidentally deploy) [21:48:29] 17:00 rollout time (prep before) gives us 5 of his hours, *should* be good, no? [21:48:57] hoo: yeah [21:49:01] doit [21:49:03] that sounds sufficient to me [21:49:15] marxarelli: kk, I'll edit wiki page again [21:50:45] (03PS2) 10Hoo man: Disable commons category sidebar link overwrite in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266335 (https://phabricator.wikimedia.org/T124403) [21:51:10] (03CR) 10Hoo man: [C: 032] Disable commons category sidebar link overwrite in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266335 (https://phabricator.wikimedia.org/T124403) (owner: 10Hoo man) [21:51:36] (03Merged) 10jenkins-bot: Disable commons category sidebar link overwrite in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266335 (https://phabricator.wikimedia.org/T124403) (owner: 10Hoo man) [21:52:53] robh (or whoever else can help): deployment-pdf01 seems to have a hung puppet and/or salt client. [21:52:59] 6operations, 10netops: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963619 (10Dzahn) server has been shutdown, removed from puppet, DHCP revoked puppet cert and salt-key, disable notifications and removed from Icinga/stored configs [21:53:02] 6operations, 10Parsoid: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1963620 (10jcrespo) I need to start mysql on ruthenium and export it for the migration. The application then has to puppetize its config to point the database host to 'm5-m... [21:53:21] cscott: thats a vm i take it? [21:53:39] PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:53:41] !log hoo@mira Synchronized wmf-config/Wikibase-production.php: Disable (not yet deployed) commons category sidebar link overwrite in production (duration: 01m 28s) [21:53:42] robh: i don't know, actually. i've never seen any physical hardware for any of the machines i log on to. ;) [21:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:54:08] for that matter, there are a bunch of people on this channel who might be virtual as well, for all I know. ;) [21:54:28] cscott: uh, where exactly is this hung? [21:54:32] cuz i dont see that host as a host. [21:54:58] deployment-pdf01.deployment-prep.eqiad.wmflabs [21:55:06] oh, its wmflabs... [21:55:08] cscott: sounds like Beta Cluster host. you might want to bring that over to #-releng [21:55:15] thcipriani is helping me over in -releng, thanks. [21:55:24] cscott: i have zero idea how to do things in labs, thats not really production [21:55:28] it was pretty quiet in there for 10 minutes or so, so I turned here for help. [21:55:56] robh: well, assuming i get OCG deployed in labs, my next step will be to deploy to production, so hopefully i'm not right back here asking for the same git-deploy help on a production box. [21:56:15] cscott: you dont have root on the labs box you are working on? [21:56:52] robh: i have root. i just don't have the knowledge-of-salt necessary to know what to do with root. [21:57:21] aude: greg-g: I talked to Lydia and we decided to not deploy new Wikidata code this week [21:57:28] So don't mind us at all [21:57:34] oh, its not a 'do this for me please' its a 'why the hell is this happening?' that is quite different heh [21:57:40] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1963647 (10Dzahn) [21:57:45] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:57:55] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [21:58:10] thcipriani seems to think it's a server name mismatched caused by renaming the puppetmaster in labs. [21:58:14] hoo: ok [21:58:35] you should direct your question to the labs team i think, i have no idea how to handle their puppetmaster migrations, sorry [21:58:44] andrewbogott or YuviPanda or chasemp =] [21:58:54] 6operations, 10netops: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963654 (10Dzahn) @papaul is going to follow-up with a DNS change and subtask for onsite-tech [21:58:56] (ping! ;) [21:59:02] ? [21:59:09] 6operations, 10netops: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963655 (10Dzahn) a:5Dzahn>3Papaul [21:59:12] who should? [21:59:13] cscott is having issues with his labs host and puppetmasters [21:59:16] oh [21:59:24] no that's deployment-prep which is solely releng's realm :) [21:59:38] I did ask him to ask them [21:59:43] oh, ok [21:59:45] but I guess they're all busy with the mw train [22:00:10] yeah, he had to wait 10 minutes for a response to a beta cluster question :) [22:00:15] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [22:00:15] 6operations, 10Parsoid: Need databases provisioned for parsoid-rt testing, visual diff testing - https://phabricator.wikimedia.org/T124703#1963657 (10jcrespo) I am cloning the data before exporting it to start mysql on a file copy to not modify the original. That will take some time. [22:00:36] (palladium is me, fixed) [22:01:13] cscott: if its an issue where we have beta cluster downtime and I need to start waking up opsen due to it being an emergency i can help. if its troubleshooting the puppet issue in beta labs on a technical level (but non emergency) then I'm not the best person to help you with this. [22:01:35] I realize that its a shitty answer though! [22:01:45] 6operations, 10netops: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963665 (10Papaul) 3NEW a:3Dzahn [22:01:55] (and it sounds like im passing the buck since i effectively am, but its not due to lack of care but lack of knowlege) [22:02:14] it's alright, tyler is on it [22:02:15] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:02:18] 6operations, 10netops: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963665 (10Papaul) [22:02:23] over in -releng :) [22:02:25] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [22:02:26] ok, cool [22:02:33] ok! [22:02:38] 6operations, 10netops: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963677 (10Dzahn) [22:02:39] yeah, no worries. i was really asking here just because the knowledge-of-salt was here so i might get some hints as to what to try. but, like i said, tyler showed up, and seems to have the knowledge bits i lack. [22:03:55] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: puppet fail [22:04:05] 6operations, 10ops-eqiad: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963665 (10Dzahn) [22:04:17] (03CR) 10Jhobs: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [22:05:45] (03PS3) 10Andrew Bogott: Move some ldap host/cert name stuff into hiera. [puppet] - 10https://gerrit.wikimedia.org/r/266325 [22:09:01] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1963710 (10JanZerebecki) https://grafana-admin.wikimedia.org/dashboard/db/tmp-t124418 [22:10:54] jynus, if it makes it simpler, you can delete all data that is older than 1 month from the testreduce_0715 database. [22:11:26] (03PS1) 10Legoktm: Add debug log for T124356 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266410 (https://phabricator.wikimedia.org/T124356) [22:12:35] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:14:39] (03PS2) 10BBlack: lvs: convert ESAMS backup load balancers to etcd [puppet] - 10https://gerrit.wikimedia.org/r/266304 (owner: 10Giuseppe Lavagetto) [22:18:19] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1963752 (10ori) >>! In T124418#1963710, @JanZerebecki wrote: > https://grafana-admin.wikimedia.org/dashboard... [22:18:31] (03PS4) 10Andrew Bogott: Move some ldap host/cert name stuff into hiera. [puppet] - 10https://gerrit.wikimedia.org/r/266325 [22:21:07] subbu, sadly, as there are not logical backups, I have to set it up, then export and import to the new place [22:21:14] RECOVERY - salt-minion processes on mira is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:21:39] jynus, ah, right, makes sense. [22:22:26] I logged into tin by habit, and woah, that's some scary ascii art [22:22:55] (03PS2) 10Ori.livneh: Add a service alias for mw1017 (app server debug backend) [dns] - 10https://gerrit.wikimedia.org/r/266332 [22:23:00] (03PS1) 10Ori.livneh: Undo special-casing of testwiki in Varnish [puppet] - 10https://gerrit.wikimedia.org/r/266414 [22:23:22] (03PS5) 10Andrew Bogott: Move some ldap host/cert name stuff into hiera. [puppet] - 10https://gerrit.wikimedia.org/r/266325 [22:24:10] legoktm@mira:/srv/mediawiki-staging/php-1.27.0-wmf.10$ git rebase origin/wmf/1.27.0-wmf.10 [22:24:10] Cannot rebase: You have unstaged changes. [22:24:10] Please commit or stash them. [22:25:17] (03CR) 10Andrew Bogott: [C: 032] Move some ldap host/cert name stuff into hiera. [puppet] - 10https://gerrit.wikimedia.org/r/266325 (owner: 10Andrew Bogott) [22:25:35] oh, ugh [22:26:02] (03PS3) 10BBlack: lvs: convert ESAMS backup load balancers to etcd [puppet] - 10https://gerrit.wikimedia.org/r/266304 (owner: 10Giuseppe Lavagetto) [22:26:10] (03CR) 10BBlack: [C: 032 V: 032] lvs: convert ESAMS backup load balancers to etcd [puppet] - 10https://gerrit.wikimedia.org/r/266304 (owner: 10Giuseppe Lavagetto) [22:26:51] 10Ops-Access-Requests, 6operations: Grant James F. (`jforrester`) access to Hive (`analytics-privatedata-users`) - https://phabricator.wikimedia.org/T124719#1963797 (10Jdforrester-WMF) 3NEW [22:27:15] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:28:21] !log legoktm@mira Synchronized php-1.27.0-wmf.10/includes/content/WikitextContent.php: https://gerrit.wikimedia.org/r/#/c/266401/ (duration: 01m 29s) [22:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:28:34] 10Ops-Access-Requests, 6operations: Grant James F. (`jforrester`) access to Hive (`analytics-privatedata-users`) - https://phabricator.wikimedia.org/T124719#1963815 (10TrevorParscal) As James's manager I approve. [22:30:25] !log legoktm@mira Synchronized php-1.27.0-wmf.10/includes/parser/: https://gerrit.wikimedia.org/r/#/c/266401/ + https://gerrit.wikimedia.org/r/#/c/266406/ + live hacks (duration: 01m 28s) [22:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:30:35] (03CR) 10Legoktm: [C: 032] Add debug log for T124356 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266410 (https://phabricator.wikimedia.org/T124356) (owner: 10Legoktm) [22:30:58] (03Merged) 10jenkins-bot: Add debug log for T124356 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266410 (https://phabricator.wikimedia.org/T124356) (owner: 10Legoktm) [22:31:06] mira, eh? [22:31:26] hmm, I just lost my shell to mira [22:32:03] uhh, and now I can't ssh in to mira [22:32:18] wfm [22:37:20] having some problems with trebuchet in beta. The problems with trebuchet deployment this morning: were they caused by the move to mira or something else? Like the redis returner for instance... [22:37:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [1000.0] [22:37:40] (the problems with trebuchet deployment in production this morning, I mean) [22:38:05] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:41:15] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [22:42:36] ori, et al. Lego says his home Internet is having issues. [22:42:38] what's up? [22:42:44] that 5xx increase is real [22:42:49] And he'll be back in about ~15 minutes. [22:42:57] does he need anything? [22:43:15] * ori looks at the error logs [22:44:10] He said "~15 minutes" at :39 after, so he'll be back in about 10 minutes, I imagine. I asked if he needs anything. Waiting to hear back. [22:44:52] so far it looks like the 5xx's took off around :28 and ended around :40 [22:45:01] but sometimes the data lags a bit, so not sure about the end yet [22:45:14] it lines up with 22:28 < logmsgbot> !log legoktm@mira Synchronized php-1.27.0-wmf.10/includes/content/WikitextContent.php: https://gerrit.wikimedia.org/r/#/c/266401/ (duration: 01m 29s) [22:45:25] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [22:46:25] (we're still talking about a way-sub-1% 5xx rate, but it's a notable bump, from a usual rate of ~0.7/s to ~30/s) [22:46:36] message repeated 8 times: [ #012Fatal error: [] operator not supported for strings in /srv/mediawiki/php-1.27.0-wmf.10/includes/parser/ParserOutput.php on line 237] [22:46:42] legoktm: ^ [22:47:21] ori: ^ [22:47:38] i'll fix [22:48:37] thanks [22:48:41] thanks! [22:48:57] (and yeah the error graphs are lagging, it's still ongoing) [22:51:15] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [22:51:55] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [22:52:54] !log ori@mira Synchronized php-1.27.0-wmf.10/includes/parser/ParserOutput.php: Fix-up for ParserOutput.php@263 debug logging (duration: 01m 27s) [22:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:53:10] (03PS1) 10Dereckson: Set category collation to uca-lt on lt.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266427 (https://phabricator.wikimedia.org/T123627) [22:55:49] ah crap sorry [22:55:50] I lived-hacked the fix since I don't know what the relevant changes were trying to do; please hold off on syncing anything until legoktm comes back and takes a look. [22:55:53] my home internet totally died on me [22:56:16] no worries, thanks for getting a message through [22:56:25] (and thanks, Leah) [22:56:57] (03PS1) 10RobH: add jforrester to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/266428 [22:57:27] legoktm: I live-hacked these lines: https://dpaste.de/FV3H/raw [22:57:40] 10Ops-Access-Requests, 6operations: Grant James F. (`jforrester`) access to Hive (`analytics-privatedata-users`) - https://phabricator.wikimedia.org/T124719#1963955 (10RobH) 5Open>3stalled As James already has a shell user (and has signed L3), with Trevor's above approval this request is slated for deploym... [22:58:01] 10Ops-Access-Requests, 6operations: Grant James F. (`jforrester`) access to Hive (`analytics-privatedata-users`) - https://phabricator.wikimedia.org/T124719#1963957 (10Jdforrester-WMF) Thank you. [22:58:04] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:58:14] oh bah [22:58:18] I forgot about cached entries [22:58:21] thanks ori [22:58:29] np [22:59:59] (03PS2) 10BBlack: lvs: enable etcd for pybal on all ESAMS balancers [puppet] - 10https://gerrit.wikimedia.org/r/266305 (owner: 10Giuseppe Lavagetto) [23:00:45] (03CR) 10BBlack: [C: 032 V: 032] lvs: enable etcd for pybal on all ESAMS balancers [puppet] - 10https://gerrit.wikimedia.org/r/266305 (owner: 10Giuseppe Lavagetto) [23:01:17] (03PS1) 10Rush: nfsd: bump threads up to around active client count [puppet] - 10https://gerrit.wikimedia.org/r/266429 [23:02:25] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:02:30] (03PS2) 10Rush: nfsd: bump threads up to around active client count [puppet] - 10https://gerrit.wikimedia.org/r/266429 [23:04:13] (03CR) 10Andrew Bogott: [C: 031] nfsd: bump threads up to around active client count [puppet] - 10https://gerrit.wikimedia.org/r/266429 (owner: 10Rush) [23:04:27] (03PS2) 10Ori.livneh: Undo special-casing of testwiki in Varnish [puppet] - 10https://gerrit.wikimedia.org/r/266414 [23:05:04] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:06:18] (03CR) 10Yuvipanda: nfsd: bump threads up to around active client count (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266429 (owner: 10Rush) [23:06:55] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:07:42] !log legoktm@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/266410/ (duration: 01m 35s) [23:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:07:56] (03PS3) 10Rush: nfsd: bump threads up to around active client count [puppet] - 10https://gerrit.wikimedia.org/r/266429 [23:08:24] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [23:09:07] (03PS1) 10Dereckson: Get rid of $wg = $wmg for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) [23:13:51] (03PS4) 10Rush: nfsd: bump threads up to around active client count [puppet] - 10https://gerrit.wikimedia.org/r/266429 [23:14:42] !log legoktm@mira Synchronized php-1.27.0-wmf.10/includes/parser/: live hacks, now committed (duration: 01m 27s) [23:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:01] hey ori, if I need to load stuff from disk while in CommonSettings, is a php file the best way to do it? [23:15:08] (03CR) 10Andrew Bogott: [C: 031] nfsd: bump threads up to around active client count [puppet] - 10https://gerrit.wikimedia.org/r/266429 (owner: 10Rush) [23:15:24] everything is deployed now and fatalmonitor looks decent... [23:15:39] I seem to recall you switched the wikiversions stuff around? [23:17:01] yes, and yes [23:17:11] (03CR) 10Rush: [C: 032] nfsd: bump threads up to around active client count [puppet] - 10https://gerrit.wikimedia.org/r/266429 (owner: 10Rush) [23:17:35] depends what you mean by loading stuff from disk; if it's the sort of thing that could be represented as an associative array, then yes. but don't base64 encode images or anything. [23:17:40] (re Krenair ^) [23:18:36] ori, was much simpler than that actually. basically we want to pick up a couple of LDAP passwords stored on disk by puppet [23:19:13] currently it's a php include, but I think it needs to be moved outside of the deployment directories [23:20:02] was thinking of changing it around to work differently [23:23:14] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [23:25:59] Krenair: Sounds OK. What did you have in mind? [23:30:51] 6operations, 10CirrusSearch, 6Discovery, 7Elasticsearch: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#1956116 (10Deskana) [23:44:38] 6operations, 10CirrusSearch, 6Discovery, 7Elasticsearch: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#1964147 (10EBernhardson) an nginx proxy seems very simple to do, and I can put that together if desired. I realizes it's a ton more work, hardware, and I honestly... [23:48:35] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:51:21] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#1964177 (10Deskana) p:5Triage>3Low