[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170106T0000). Please do the needful. [00:00:14] (03PS1) 10Filippo Giunchedi: puppetmaster: fail on private post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/330824 [00:02:00] PROBLEM - puppet last run on restbase-test1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/cassandra-instances.d] [00:02:49] right, so those are old hostnames which shouldn't be in icinga anyways, they don't even resolve in dns [00:03:30] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [00:05:00] RECOVERY - puppet last run on restbase-test1003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [00:11:43] !log carbon - stopping ganglia-monitor-aggregator for good [00:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:50] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:12:50] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:13:00] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [00:13:28] !log analytics1036, ms-fe1003 - ran puppet to fix Icinga [00:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:00] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [00:15:40] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [00:15:40] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [00:23:49] (03PS7) 10Dzahn: icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717) [00:25:29] (03PS8) 10Dzahn: icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717) [00:34:11] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [00:37:12] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:43:20] (03PS2) 10Filippo Giunchedi: Pass the filtered request headers to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330646 (https://phabricator.wikimedia.org/T151066) (owner: 10Gilles) [00:43:36] (03PS2) 10Filippo Giunchedi: Add PoolCounter configuration to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330647 (https://phabricator.wikimedia.org/T151066) (owner: 10Gilles) [00:44:10] (03PS1) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) [00:46:36] @seen jouncebot [00:46:36] mutante: jouncebot is in here, right now [00:46:48] jouncebot: next [00:46:48] In 0 hour(s) and 13 minute(s): Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170106T0100) [00:47:11] (03CR) 10Filippo Giunchedi: [C: 032] Add PoolCounter configuration to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330647 (https://phabricator.wikimedia.org/T151066) (owner: 10Gilles) [00:47:20] out for 13 minutes :) [00:47:42] (03CR) 10Filippo Giunchedi: [C: 032] Pass the filtered request headers to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330646 (https://phabricator.wikimedia.org/T151066) (owner: 10Gilles) [00:59:34] Is Gerrit really down right now? [00:59:51] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.81 and port 29418: Connection refused [00:59:51] PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [01:00:01] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater],Exec[git_pull_geowiki-scripts],Exec[git_pull_statistics_mediawiki] [01:00:04] ostriches and mutante: Respected human, time to deploy Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170106T0100). Please do the needful. [01:00:21] divadsn: yes scheduled maintenance [01:00:31] though a !log would have been nice [01:00:40] !log [01:00:44] Eh, forgot to !log sorry [01:00:51] !log gerrit: down for upgrade [01:00:51] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.13.4-13-gc0c5cc4742 (SSHD-CORE-1.2.0) (protocol 2.0) [01:00:51] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [01:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:06] !log gerrit: back up from upgrade [01:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:24] nice, thanks ostriches [01:01:31] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [01:01:52] now that gerrit is back, using gerrit to merge gerrit config change :) [01:02:23] Yep, gonna do our swap to logstash during the window :) [01:02:25] (03PS26) 10Dzahn: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [01:02:27] godog, ah okay, thanks for info :) [01:03:21] (03CR) 10Dzahn: [C: 032] Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [01:04:28] ostriches: any other small change that was ok and waiting? [01:05:01] Keep getting ':502 Bad Gateway' from RCStream? [01:05:04] _not_ the one changing database collation in connector [01:05:06] Did something happen recently? [01:05:06] heh [01:05:10] http://codepen.io/Krinkle/pen/laucI/?editors=0010 [01:05:14] Fails 9/10 times when I refresh [01:05:26] 502 from ngin [01:05:27] x [01:05:38] https://stream.wikimedia.org/socket.io/1/?t=1483664718061 [01:05:55] mutante: That was the only one I had planned for this window [01:06:03] it's supposed to return something like "450610263156:60:60:websocket,xhr-multipart,htmlfile,jsonp-polling,flashsocket,xhr-polling" but most of the time doesn't [01:06:06] ostriches: alright... .and applied. [01:06:31] Ok, restarting service to pick it up [01:06:47] !log stream.wikimedia.org problems - nginx responds with HTTP 502 Bad Gateway to most requests [01:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:26] Hmm, service awfully slow.... [01:08:42] stat1003 tried to git clone something from gerrit right during the time [01:10:01] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:10:08] mutante: Constant timeouts from gerrit, let's revert logstash for now. [01:10:21] ugh, ok [01:11:10] well, i think we need to live hack [01:11:16] Already on it [01:11:16] so we can use gerrit to revert [01:11:21] ok [01:11:27] Disabled puppet, reverted, restarting [01:11:52] Yep, and service is back + snappy again [01:11:56] So yeah, that was causing it [01:12:15] (03PS1) 10Dzahn: Revert "Gerrit: Enable logstash in gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/330830 [01:12:19] (03PS1) 10Chad: Revert "Gerrit: Enable logstash in gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/330831 [01:12:24] Hah [01:12:30] (03Abandoned) 10Chad: Revert "Gerrit: Enable logstash in gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/330831 (owner: 10Chad) [01:12:45] (03CR) 10Dzahn: [C: 032] "unfortunately this caused gerrit to timeout, it came back when reverting this change" [puppet] - 10https://gerrit.wikimedia.org/r/330830 (owner: 10Dzahn) [01:12:47] (03CR) 10Chad: [C: 031] Revert "Gerrit: Enable logstash in gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/330830 (owner: 10Dzahn) [01:13:39] (03PS1) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) [01:13:46] hehee [01:13:48] merged on master, you can enable again [01:14:09] paladox: sup [01:14:09] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [01:14:15] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [01:14:21] that stuff is because puppet wants to git clone [01:14:29] the other thing is a netsplit [01:14:40] mutante hi, i think ostriches hit the timeout because it probaly could not hit logstash in prod [01:14:50] logstash1002.eqiad.wmnet [01:14:57] paladox: yea, first guess will be missing ferm rule? [01:15:05] yep [01:15:20] Ok, puppet back on cobalt [01:15:23] we can always add support for logstash but not enable it by default [01:16:03] let's check the iptables rules on the destination [01:16:04] (03PS2) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) [01:16:21] make another change to fix that there if it is indeed the missing hole [01:16:29] mutante that ^^ should at least add support but have it disabled now :) [01:16:34] and ok [01:16:35] Ok, well bummer that didn't work [01:16:46] support but not enable sounds ok'ish to me [01:16:56] yep [01:18:08] paladox: ostriches: i guess puppet/modules/role/manifests/logstash/collector.pp [01:18:18] with all the exisiting ferm::service in there [01:18:28] collector.pp: ferm::service { 'logstash_log4j': [01:18:50] but we looked at that and saw srange is $DOMAIN_NETWORKS [01:18:58] yep [01:19:05] so... that should have worked [01:19:35] https://github.com/wikimedia/operations-puppet/blob/e959321aa620b77403cc9379db2e86080323c6e8/modules/base/templates/firewall/defs.erb#L2 [01:19:39] mutante ^^ [01:19:50] you should be able to test with telnet right if it's a tcp connection? [01:19:52] did the log_port get set correct? [01:20:04] i see there's a variable there in that template [01:20:10] Terrible time for a netsplit [01:20:14] lol, yea [01:20:42] mutante i think so [01:20:44] logstash1002.eqiad.wmnet [01:20:56] mutante try telnet for logstash1002.eqiad.wmnet please? [01:21:15] i am, port 4560 [01:21:21] thanks [01:21:23] i connect to something, which then disconnects me again [01:21:23] yep [01:21:28] so not firewall [01:21:33] oh [01:22:09] Freaking netsplit, terrible timing [01:23:28] mutante: logstash on logstash1002 looks busted :/ [01:23:29] Ok, caught up IRC client? [01:23:31] Yes? [01:23:32] Good. [01:23:42] ostriches: ohi [01:23:46] mutante: I sent you an e-mail [01:23:57] I'm closing the window, I don't wanna play monkey patch after 5pm [01:24:01] We got lucky the actual upgrade was fast [01:24:10] Let's not jinx it with the logstash thing [01:24:16] That was never *urgent* [01:24:31] is the syntax right in this erb ? [01:24:32] https://gerrit.wikimedia.org/r/#/c/326177/26/modules/gerrit/templates/log4j.properties.erb [01:24:51] ostriches: fair, yes :) [01:24:54] mutante: Yes, it was correct, log4j file was fine [01:25:00] mutante yep, tested locally [01:25:10] without setting the log_host it will not set any tcp [01:25:18] with log_host set it will set the tcp [01:25:28] tested with the local puppet master i have [01:25:58] alright, was wondering for a second if i should have seen the literal values in the diff: [01:26:01] -log4j.appender.tcp.Port=<%= @log_port %> [01:26:04] -log4j.appender.tcp.RemoteHost=<%= @log_host %> [01:26:11] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [01:26:20] mutante: Um....no? [01:26:21] I dunno [01:26:27] It looked fine on disk earlier :D [01:26:37] eg: log4j.appender.tcp.RemoteHost=logstash1002.eqiad.wmnet [01:26:47] ok, good [01:27:00] well, then, not jinxing it [01:27:05] but weird [01:27:14] !log Restarted logstash on logstash1002 (T154732) [01:27:15] tests were done with a puppetmaster [01:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:19] T154732: Exception in thread "Ruby-0-Thread-18: /opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.20/lib/stud/buffer.rb:92" java.lang.UnsupportedOperationException - https://phabricator.wikimedia.org/T154732 [01:27:33] that ^ may have had something to do with it [01:27:41] aah [01:27:59] looked like logstash soft died a day ago on that node :/ [01:28:00] because this is a new plugin, right [01:28:01] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [01:28:08] oh, a day ago, interesting [01:28:21] we've had the java input thing for a while [01:28:55] I need to look back at sal but I think that's the same error that took logstash1001 down over the holidays [01:29:08] bd808: First time using it, but we enabled the actual plugin a month or so ago ya [01:29:31] well, i'd be up for one more try, if you want to [01:29:31] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [01:29:44] yeah, same error as T154388 [01:29:45] T154388: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388 [01:31:39] runs puppet on hosts were puppet tried to git clone [01:31:50] when service status tells you "Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable." where can you find the logs? [01:32:39] what about the classic /var/log/logstash/ there? [01:32:57] i see some gzipped ones [01:33:01] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [01:33:12] who watches the watcher? [01:33:13] :) [01:34:53] I was just trying to see if logstash1003 had croaked from the same thing [01:35:12] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [01:35:12] and it seemed to show up in journalctl rather than the on disk logs [01:35:42] I see recent timestamps in /var/log/logstash on 1003 though so it's probably ok [01:36:00] the logstash cluster really needs an owner :/ [01:36:01] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [01:36:31] bd808: /run/log/journal/ [01:37:11] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:37:22] mutante: heh. those are some fancy filenames [01:37:32] "config file is located at /etc/systemd/journald.conf and it places journal files at /var/log/journal/[machine-id]/*.journal if it exists, otherwise it places them in /run/log/journal/[machine-id]/*.journal" [01:38:03] bd808 the error you described sounds very similar to https://github.com/elastic/logstash/issues/3811 [01:38:03] they are also binary noise [01:39:28] bd808: journalctl --file= ? [01:39:36] hmm [01:40:02] Takes a file glob as an argument. If specified, journalctl will operate on the specified journal files matching GLOB instead of the default runtime and system journal paths [01:41:27] yea, that seems to work [01:41:38] journalctl --file=system@14b8e69a54b4470686e86b9017165c36-00000000001bf1ed-000544dc20f23834.journal [01:41:50] gets the Dec 30 stuff [01:43:11] https://github.com/elastic/logstash/issues/4054#issuecomment-149382295 [01:48:11] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1800.731944 Seconds [01:48:21] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 1806.479707 Seconds [01:48:21] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 1806.491425 Seconds [01:48:26] (03PS1) 10Zhuyifei1999: Unassign 'transcode-reset' from Commons autoconfirmed and sysop groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330835 (https://phabricator.wikimedia.org/T154733) [01:49:11] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 25.720363 Seconds [01:49:21] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 31.5998 Seconds [01:49:21] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 31.598596 Seconds [01:50:52] (03CR) 10Krinkle: Add DB "shard" column to logstash log entries for labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 (owner: 10Aaron Schulz) [01:53:23] Krinkle: i dont know what this means but maybe a change in misc varnish "upstream prematurely closed connection while reading response header from upstream" [01:54:00] a ticket would be good [01:54:14] sorry, earlier we were in the middle of gerrit upgrade and also netsplit [01:56:09] also i think analytics-ops working on replacing it [02:14:37] 06Operations, 13Patch-For-Review: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#2922293 (10Dzahn) I moved the eqiad Ganglia aggregator from carbon to install1001 today. This part is now unblocked :) [02:14:41] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:15:41] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [02:26:11] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [02:33:30] (03PS2) 10Andrew Bogott: Openstack: Forward some custom config changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/330626 [02:33:32] (03PS1) 10Andrew Bogott: Keystone: Add more uwsgi api processes [puppet] - 10https://gerrit.wikimedia.org/r/330837 [02:33:44] (03PS1) 10Andrew Bogott: Nova: turn off ec2 api [puppet] - 10https://gerrit.wikimedia.org/r/330838 [02:34:43] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.7) (duration: 13m 23s) [02:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:05] (03CR) 10Andrew Bogott: [C: 032] Keystone: Add more uwsgi api processes [puppet] - 10https://gerrit.wikimedia.org/r/330837 (owner: 10Andrew Bogott) [02:53:55] _joe_: Ping? [02:54:08] elukey: You too, I guess. :) [02:54:25] https://commons.wikimedia.org/wiki/Commons:Village_pump#Temporary_change_of_user_group_rights [02:55:11] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [03:03:00] (03CR) 10Dzahn: [C: 032] icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn) [03:03:07] (03PS9) 10Dzahn: icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717) [03:03:34] Revent: it's a bad time, 4 AM in Europe [03:04:24] but they'll see it later i'm sure, yea [03:04:29] mutante: Yeah, it was just an FYI for the, [03:04:32] *them [03:04:35] yep [03:05:26] !log OS instalaltion on elastic2025-elastic2036 [03:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:41] Revent: sounds good how admins can add themselves if needed [03:06:30] Yeah, the goal really is to prevent anyone who is ‘unaware’ from poking at the button. [03:07:48] My ‘draft’ of that message (I got feedback first) rather dumbly pointed out that doing so was effectively a way to execute an untrackable DOS attack. :/ [03:10:49] heh, well, it sounds good to me now. thanks for doing that [03:22:33] volans: godog: https://gerrit.wikimedia.org/r/#/c/327686/ :) [03:23:01] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 784.00 seconds [03:27:16] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2922386 (10Papaul) [03:30:01] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 272.11 seconds [03:33:34] (03PS1) 10Dzahn: icinga: unbreak dependency cycle with Apache site and cert [puppet] - 10https://gerrit.wikimedia.org/r/330839 [03:37:22] (03PS2) 10Dzahn: icinga: unbreak dependency cycle with Apache site and cert [puppet] - 10https://gerrit.wikimedia.org/r/330839 [03:38:26] (03CR) 10Dzahn: [C: 032] icinga: unbreak dependency cycle with Apache site and cert [puppet] - 10https://gerrit.wikimedia.org/r/330839 (owner: 10Dzahn) [03:42:11] !log icinga - debugging issue with cert change [03:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:43:01] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-icinga] [03:56:39] (03PS1) 10Dzahn: icinga: Include challenge-apache.conf, exclude acme from proto redirect [puppet] - 10https://gerrit.wikimedia.org/r/330841 [03:57:01] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [03:59:54] (03CR) 10Dzahn: [C: 032] icinga: Include challenge-apache.conf, exclude acme from proto redirect [puppet] - 10https://gerrit.wikimedia.org/r/330841 (owner: 10Dzahn) [04:10:20] !log Icinga now using Letsencrypt cert and all good [04:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:44] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + maybe accounts - https://phabricator.wikimedia.org/T154205#2922451 (10demon) 05Open>03Resolved a:03demon This rolled out today [04:13:01] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + maybe accounts - https://phabricator.wikimedia.org/T154205#2922457 (10demon) [04:18:18] (03CR) 10Dzahn: "needed follow-up to include acme-challenge snippet and exclude the challenge URL from http->https redirect https://gerrit.wikimedia.org/r/" [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn) [04:18:50] 06Operations, 10Traffic, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2922470 (10Dzahn) [04:19:08] 06Operations, 10Traffic, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10Dzahn) Icinga switched to LE just now. [04:21:12] (03CR) 10Krinkle: build: require-dev phpunit in composer.json (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (owner: 10Krinkle) [04:24:21] (03PS2) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 [04:24:59] (03CR) 10jerkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (owner: 10Krinkle) [04:27:09] !log Started FlowFixInconsistentBoards.php (production mode) on all wikis [04:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:06] !log Finished FlowFixInconsistentBoards.php (production mode) on all wikis [04:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:12] (03PS3) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) [04:38:48] (03CR) 10jerkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [04:56:28] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [04:56:33] (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [04:57:01] (03CR) 10jerkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [04:57:09] (03CR) 10jerkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [04:57:20] (03PS1) 10KartikMistry: Fix parameter alignment [puppet] - 10https://gerrit.wikimedia.org/r/330844 [04:58:36] (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [04:59:07] (03CR) 10jerkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [05:00:47] (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [05:01:41] (03CR) 10jerkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [05:18:32] (03PS4) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 [05:19:25] (03PS5) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 [05:20:08] (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (owner: 10Krinkle) [05:24:20] (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (owner: 10Krinkle) [05:32:31] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [05:59:32] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:28:01] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:28:11] PROBLEM - carbon-cache@b service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is failed [06:28:31] PROBLEM - carbon-cache@f service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is failed [06:35:11] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [06:49:11] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:49:33] (03PS2) 10Muehlenhoff: Puppetise yubikey-val (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/285962 [06:50:34] (03PS1) 10Urbanecm: Enable Extension:Babel's category on cswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330847 (https://phabricator.wikimedia.org/T67211) [06:54:31] RECOVERY - carbon-cache@f service on graphite1003 is OK: OK - carbon-cache@f is active [06:55:01] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [06:55:11] RECOVERY - carbon-cache@b service on graphite1003 is OK: OK - carbon-cache@b is active [06:55:31] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:57:11] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:57:21] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:02:11] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [07:04:21] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [07:07:21] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:08:31] !log installing crypto++ security updates on trusty hosts [07:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:22] (03CR) 10Dereckson: [C: 031] Unassign 'transcode-reset' from Commons autoconfirmed and sysop groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330835 (https://phabricator.wikimedia.org/T154733) (owner: 10Zhuyifei1999) [07:29:18] (03CR) 10Dereckson: [C: 031] Enable Extension:Babel's category on cswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330847 (https://phabricator.wikimedia.org/T67211) (owner: 10Urbanecm) [07:35:11] PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:55:11] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:03:21] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:04:11] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:13:21] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:16:31] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [08:23:11] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [08:30:12] PROBLEM - Host auth2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:12] PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100% [08:30:12] PROBLEM - Host acamar is DOWN: PING CRITICAL - Packet loss = 100% [08:30:12] PROBLEM - Host 2620:0:860:1:208:80:153:12 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:21] RECOVERY - Host auth2001 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [08:30:31] RECOVERY - Host acamar is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [08:30:31] PROBLEM - Host cp2004 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:31] PROBLEM - Host cp2006 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:31] PROBLEM - Host cp2005 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:51] PROBLEM - Host mc2005 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:51] PROBLEM - Host mc2006 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:51] PROBLEM - Host ms-fe2002 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:51] PROBLEM - Host ms-be2017 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:01] PROBLEM - Host mc2004 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:01] PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:21] RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [08:31:21] RECOVERY - Host 2620:0:860:1:208:80:153:12 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [08:31:21] RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [08:31:21] PROBLEM - Check systemd state on kafka2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:31:21] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:31:54] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka2002.codfw.wmnet because of too many down!: trendingedits_6699 - Could not depool server scb2004.codfw.wmnet because of too many down!: prometheus_80 - Could not depool server prometheus2001.codfw.wmnet because of too many down!: wdqs_80 - Could not depool server wdqs2001.codfw.wmnet because of too many do [08:31:54] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 117, down: 1, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae2BR [08:32:41] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/apache2/conf-available/50-server-status.conf],File[/etc/ganglia/conf.d/hhvm_mem.pyconf],File[/etc/ssh/userkeys/pybal-check] [08:33:21] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw on kafka2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw/producer\.properties [08:33:21] PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[screen],Package[jq],Package[zsh-beta] [08:33:34] PROBLEM - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 3744 threshold =0.1% breach: status: red, number_of_nodes: 24, unassigned_shards: 3648, number_of_pending_tasks: 3009, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3018, task_max_waiting_in_queue_millis: 130284, cluster_name: production-search-codfw, relocating_shards: 0, acti [08:33:41] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[htop],Package[tcpdump],Package[gdb],Package[lldpd] [08:35:31] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:31] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:31] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:31] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:31] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:32] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2006_v4, cp2006_v6 [08:35:32] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:41] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:41] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:41] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2006_v4, cp2006_v6 [08:35:41] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:41] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:41] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:41] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:42] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2006_v4, cp2006_v6 [08:35:42] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:43] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:43] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:35:44] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2006_v4, cp2006_v6 [08:36:01] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6 [08:36:01] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6 [08:36:01] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2004_v4, cp2004_v6 [08:36:01] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2004_v4, cp2004_v6 [08:36:01] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6 [08:36:01] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6 [08:36:01] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6 [08:36:02] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6 [08:36:02] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2004_v4, cp2004_v6 [08:36:03] PROBLEM - IPsec on mc1005 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2005_v4 [08:36:03] PROBLEM - IPsec on mc1006 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2006_v4 [08:36:04] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2004_v4, cp2004_v6 [08:36:21] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2004_v4, cp2004_v6 [08:36:21] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6 [08:36:31] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6 [08:36:31] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2006_v4, cp2006_v6 [08:36:31] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6 [08:36:31] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2006_v4, cp2006_v6 [08:38:00] Hi, Vasantrao Naik Govt. Institute Of Arts and Social Sciences - Nagpur event is facing problem unable to create user accounts and their IP address seems to be 117.211.27.103 [08:38:01] yeah, the only one we don't provide a range is the one bugged [08:39:21] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:05] I just got an alert on elastic search codfw shards. dcausse, could you check? I'm unable to at this point... [08:40:06] Not a high emergency, this is codfw... [08:40:06] I just got an alert on elastic search codfw shards. dcausse, could you check? I'm unable to at this point... [08:40:32] gehel: sure will have a look [08:40:42] dcausse: gehel: seems to be a general problem with codfw/network, not specific to elastic [08:40:58] (03PS1) 10Giuseppe Lavagetto: Temporarily depool codfw [dns] - 10https://gerrit.wikimedia.org/r/330848 [08:42:22] (03CR) 10Giuseppe Lavagetto: [C: 032] Temporarily depool codfw [dns] - 10https://gerrit.wikimedia.org/r/330848 (owner: 10Giuseppe Lavagetto) [08:42:39] (03PS1) 10Dereckson: Adjust throttle rule for Maharashtra 'Edit Wikipedia' workshop (VNGIASS) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330849 (https://phabricator.wikimedia.org/T154312) [08:42:41] Hi. moritzm or gehel > could you deploy this? ^ [08:43:08] <_joe_> Dereckson: not now [08:43:21] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:43:37] (03PS1) 10Ema: Route around codfw, network issues there [puppet] - 10https://gerrit.wikimedia.org/r/330850 [08:43:59] _joe_: it's a current event (India timezone), participants are there, but they can't create account. [08:44:17] the change is an IP edit to a throttle rule to unblock the situation [08:44:54] <_joe_> Dereckson: we're in the middle of an outage [08:45:17] Misinterpreted the "08:40:06 < gehel> Not a high emergency, this is codfw..." [08:45:30] k [08:47:07] (03CR) 10Giuseppe Lavagetto: [C: 032] Route around codfw, network issues there [puppet] - 10https://gerrit.wikimedia.org/r/330850 (owner: 10Ema) [08:47:18] <_joe_> ema: should I merge it and run puppet in ulsfo? [08:47:21] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw on kafka2001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw/producer\.properties [08:47:21] RECOVERY - Check systemd state on kafka2001 is OK: OK - running: The system is fully operational [08:47:32] _joe_: yes please [08:48:21] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:59:41] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:00:21] RECOVERY - puppet last run on elastic2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:00:41] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:01:21] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:21] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [09:05:50] (03CR) 10ArielGlenn: [C: 032] Adjust throttle rule for Maharashtra 'Edit Wikipedia' workshop (VNGIASS) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330849 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [09:06:07] thanks [09:07:21] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:07:33] it's not there yet [09:07:38] it's only merged [09:08:04] (03CR) 10jenkins-bot: Adjust throttle rule for Maharashtra 'Edit Wikipedia' workshop (VNGIASS) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330849 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [09:08:26] Yes, but thanks to take care of it. [09:10:01] scap going now, in theory [09:12:32] !log ariel@tin Synchronized wmf-config/throttle.php: Adjust throttle rule for Maharashtra 'Edit Wikipedia' workshop (VNGIASS) (duration: 02m 46s) [09:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:35] sync-apaches: 98% (ok: 297; fail: 0; left: 6) [09:12:36] and waiting for awhile [09:12:45] ah just completed [09:13:17] I reported to the task it's done, thanks. [09:13:26] fails on 'mw2119.codfw.wmnet', 'mw2215.codfw.wmnet', 'mw1201.eqiad.wmnet', 'mw2187.codfw.wmnet', 'mw1216.eqiad.wmnet' [09:13:30] yw [09:14:44] fails also on 'mw1211.eqiad.wmnet', 'mw1280.eqiad.wmnet', 'mw1161.eqiad.wmnet' looking at the mw1* hosts [09:15:36] <_joe_> apergos: I guess we'll have to remove scap proxies that are in row A in codfw [09:16:08] if there are any more scaps befor papaul comes on line [09:16:12] <_joe_> apergos: do that and re-run scap [09:16:39] <_joe_> apergos: can you look at it? [09:17:25] jouncebot: next [09:17:25] In 76 hour(s) and 42 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170109T1400) [09:17:51] _joe_, yes I'll take care of it [09:18:19] <_joe_> eu midday scap during the summit? [09:18:25] <_joe_> with almost no opsens around? [09:18:26] <_joe_> mh [09:18:34] <_joe_> anyways, no releases today [09:18:40] nope [09:23:57] _joe_: it was cabcelled in gregs email iirc [09:24:00] !log asw-a7-codfw is down, serial console unresponsive [09:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:25] (03PS1) 10ArielGlenn: remove mw hosts in row A codfw from scap proxy list for now [puppet] - 10https://gerrit.wikimedia.org/r/330852 [09:26:15] uhh [09:26:16] why? [09:26:21] apergos: why? [09:26:49] so that we can have scap working for things like the above (throttle, etc) [09:27:02] paravoid: [09:27:03] row A codfw is working [09:27:07] it's just A7 that's down, no mw* there [09:27:14] ah a7 right [09:27:15] hm [09:27:18] that's one of the 10G switches [09:27:41] * Jan 9th: no train, SWATs only (but no one from RelEng is garaunteed to [09:27:44] * be around) (DevSummit+All Hands) [09:27:55] (03Abandoned) 10ArielGlenn: remove mw hosts in row A codfw from scap proxy list for now [puppet] - 10https://gerrit.wikimedia.org/r/330852 (owner: 10ArielGlenn) [09:28:26] I would be surprised if there are a bunch of swatters who line up for that day, but you never know [09:30:38] so are there scap issues? [09:31:11] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [09:32:10] there were fails on 8 hosts, all of which tried to connect back to one of mw2080 through mw2085 [09:35:02] why though? [09:35:16] (sorry writing the task at the same time) [09:35:50] (that's fine) [09:39:27] apergos: these are servers which are currently decomissioned [09:39:35] hey moritz [09:39:50] ah great, so unrelated [09:39:52] seems Rob didn't complete these fully: https://phabricator.wikimedia.org/T154621 [09:39:55] but why did the 8 servers try to ssh to them (or scp or whatever)? [09:40:15] they're still in conftool, but he already powered them down [09:41:02] here's the failure list again: 'mw2119.codfw.wmnet', 'mw2215.codfw.wmnet', 'mw1201.eqiad.wmnet', 'mw2187.codfw.wmnet', 'mw1216.eqiad.wmnet' 'mw1211.eqiad.wmnet', 'mw1280.eqiad.wmnet', 'mw1161.eqiad.wmnet' [09:41:37] ah, not sure about those. I was referring to mw2080 through mw2085 [09:42:03] 06Operations, 10ops-codfw, 10netops: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2922945 (10faidon) [09:43:17] yeah, I got that. but you would think live servers would not refer to these mostly decommed ones for anything [09:44:12] they're also still listed in puppet, so all weird side effects can happen, such decoms should really be done in one piece... [09:46:05] sigh [09:48:33] 06Operations, 10Pybal, 10Traffic: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759#2922958 (10faidon) [09:49:48] 06Operations, 10ops-codfw, 10netops: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2922945 (10faidon) [09:53:31] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:53:57] ok that's weird but whatever: in spite of the whines from those 8 hosts, the file has in fact been synced there, I checked md5sum agains the copy on tin :-/ [09:54:02] so calling it done [09:54:10] can you try another scap? [09:54:35] sure [09:55:49] sam file I guess, as it's harmless? [09:56:09] paravoid: [09:56:18] yeah sure [09:59:01] so here we are at 98% complete again, I guess it's trying ssh transport and timing out on those mw20* hosts [09:59:04] we'll see in a minute [10:00:27] !log ariel@tin Synchronized wmf-config/throttle.php: test, noop (duration: 02m 45s) [10:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:29] ah [10:01:41] gah this is what I get for not having my morning cocoa [10:02:04] these whines from the 8 hosts: they were whines from the currently configured proxies trying to get to the mw208* hosts which are [10:02:08] half-decommissioned [10:02:15] so mystery solved [10:02:41] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:02:47] ah! [10:02:51] great, thanks [10:02:55] sorry for the noise [10:03:03] * apergos goes to get milk for cocoa :-D [10:03:04] no it's alright, useful to know [10:05:58] moritzm: did you follow up on the task about these half-decom hosts already? [10:06:45] not yet, will add a note now [10:08:17] thanks [10:14:52] (03PS2) 10DCausse: elasticsearch: tuning of zen discovery settings [puppet] - 10https://gerrit.wikimedia.org/r/316976 (https://phabricator.wikimedia.org/T154765) (owner: 10Gehel) [10:15:11] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: tuning of zen discovery settings [puppet] - 10https://gerrit.wikimedia.org/r/316976 (https://phabricator.wikimedia.org/T154765) (owner: 10Gehel) [10:16:28] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2918062 (10MoritzMuehlenhoff) This morning a deployment by Ariel of a mw-config throttling change failed since scap tried to connect to mw2080-mw2085, which have been powered down... [10:21:31] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [10:28:47] 06Operations, 10Wikimedia-SVG-rendering: SVG fails to render properly due to several issues - https://phabricator.wikimedia.org/T46016#2923124 (10Aklapper) >>! In T46016#488898, @Aklapper wrote: >> 1—Failure to render masks >> Any masked object will disappear. Clipping works. > > Maybe related to http://bu... [10:29:31] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [10:31:12] (03PS1) 10Muehlenhoff: Add NRPE check to monitor timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330854 (https://phabricator.wikimedia.org/T150257) [10:31:41] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:33:05] 06Operations, 10Wikimedia-SVG-rendering: SVG fails to render properly due to several issues - https://phabricator.wikimedia.org/T46016#2923126 (10MoritzMuehlenhoff) We'll unfortunately not be able to easily upgrade to 2.41.0 at this point; librsvg started to implement parts of the code in Rust and Debian jessi... [10:41:03] (03PS1) 10Urbanecm: [throttle] Lift for 2017-01-10+cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330855 (https://phabricator.wikimedia.org/T154312) [10:43:54] (03CR) 10Volans: [C: 04-1] "I'm not convinced is the right approach, see my comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [10:45:21] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [10:46:11] 06Operations, 10ops-codfw, 10netops: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2923142 (10ema) The impact on varnish errors has been minimal. In codfw we've had two hiccups, one at 8:30 and another smaller one at 8:40 {F5241163} In ulsfo we had a small 503 spike at 8:51 {F5241168} We... [11:00:11] (03PS5) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [11:01:03] (03CR) 10jerkins-bot: [V: 04-1] Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [11:01:21] PROBLEM - puppet last run on poolcounter1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:02:18] (03PS6) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [11:04:21] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [11:07:21] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:11:23] (03PS7) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [11:16:20] (03CR) 10Ema: Add NRPE check to monitor timesyncd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330854 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [11:18:08] (03Draft1) 10Paladox: Redirect /changes/ to /r/changes/ [puppet] - 10https://gerrit.wikimedia.org/r/330858 (https://phabricator.wikimedia.org/T154760) [11:18:12] (03Draft2) 10Paladox: Redirect /changes/ to /r/changes/ [puppet] - 10https://gerrit.wikimedia.org/r/330858 (https://phabricator.wikimedia.org/T154760) [11:22:55] (03PS2) 10Muehlenhoff: Add NRPE check to monitor timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330854 (https://phabricator.wikimedia.org/T150257) [11:29:35] RECOVERY - ElasticSearch health check for shards on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 802, number_of_pending_tasks: 904, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3027, task_max_waiting_in_queue_millis: 192913, cluster_name: production-search-codfw, relocating_shards: 0, active_shards_percent_as_nu [11:29:43] (03PS8) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [11:30:00] (03PS2) 10Urbanecm: [throttle] Lift for 2017-01-10/12 + minor cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330855 (https://phabricator.wikimedia.org/T154312) [11:31:22] RECOVERY - puppet last run on poolcounter1002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [11:31:49] (03PS1) 10Muehlenhoff: Drop reference to the manpage (not available on jessie) [puppet] - 10https://gerrit.wikimedia.org/r/330860 [11:33:17] (03CR) 10Muehlenhoff: [C: 032] Drop reference to the manpage (not available on jessie) [puppet] - 10https://gerrit.wikimedia.org/r/330860 (owner: 10Muehlenhoff) [11:34:46] (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/330854 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [11:46:34] (03PS3) 10Muehlenhoff: Add NRPE check to monitor timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330854 (https://phabricator.wikimedia.org/T150257) [11:46:41] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:50:45] (03CR) 10Muehlenhoff: [C: 032] Add NRPE check to monitor timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330854 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [11:52:06] (03PS1) 10Ema: systemd-timesyncd: fix config file [puppet] - 10https://gerrit.wikimedia.org/r/330863 (https://phabricator.wikimedia.org/T150257) [12:04:44] (03CR) 10Muehlenhoff: [C: 031] "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/330863 (https://phabricator.wikimedia.org/T150257) (owner: 10Ema) [12:08:25] (03PS2) 10Ema: systemd-timesyncd: fix config file [puppet] - 10https://gerrit.wikimedia.org/r/330863 (https://phabricator.wikimedia.org/T150257) [12:08:31] (03CR) 10Ema: [V: 032 C: 032] systemd-timesyncd: fix config file [puppet] - 10https://gerrit.wikimedia.org/r/330863 (https://phabricator.wikimedia.org/T150257) (owner: 10Ema) [12:10:44] 06Operations, 06Performance-Team, 10Thumbor: Implement rate limiter in Thumbor - https://phabricator.wikimedia.org/T151067#2923213 (10Gilles) [12:10:46] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Implement PoolCounter support in Thumbor - https://phabricator.wikimedia.org/T151066#2923214 (10Gilles) [12:12:34] (03PS1) 10Muehlenhoff: Switch cache servers in ulsfo to timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330865 (https://phabricator.wikimedia.org/T150257) [12:14:44] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [12:20:52] (03PS1) 10Gilles: Upgrade to 0.1.32 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/330866 [12:22:20] (03PS1) 10Gilles: Add new mandatory config value for SVG engine in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330867 (https://phabricator.wikimedia.org/T150754) [12:22:30] (03PS3) 10Muehlenhoff: Switch swift in esams to systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330404 (https://phabricator.wikimedia.org/T150257) [12:24:25] (03CR) 10Ema: [C: 031] Switch cache servers in ulsfo to timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330865 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [12:29:42] (03PS1) 10Gilles: Switch Thumbor to swift loader [puppet] - 10https://gerrit.wikimedia.org/r/330869 (https://phabricator.wikimedia.org/T151441) [12:31:38] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [12:41:43] (03CR) 10Muehlenhoff: "The admin group for datacenter ops is currently bound to the "salt::master::production" role via hieradata/role/common/salt/masters/produc" [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [12:58:36] (03PS3) 10Paladox: Redirect /changes/ to /r/changes/ [puppet] - 10https://gerrit.wikimedia.org/r/330858 (https://phabricator.wikimedia.org/T154760) [12:59:38] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [13:00:31] (03CR) 10Hashar: [C: 04-1] "Gerrit is exposed under /r/ the issue reported in the task is due to git-review." [puppet] - 10https://gerrit.wikimedia.org/r/330858 (https://phabricator.wikimedia.org/T154760) (owner: 10Paladox) [13:33:34] (03PS4) 10Hashar: build: update rubocop to 0.39 and tweak config [puppet] - 10https://gerrit.wikimedia.org/r/330470 [13:33:37] (03CR) 10Hashar: [C: 04-1] build: update rubocop to 0.39 and tweak config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [13:33:52] (03CR) 10jerkins-bot: [V: 04-1] build: update rubocop to 0.39 and tweak config [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [13:34:11] (03PS1) 10Muehlenhoff: Remove access credentials for asherman [puppet] - 10https://gerrit.wikimedia.org/r/330885 (https://phabricator.wikimedia.org/T152957) [13:37:55] (03PS5) 10Hashar: build: update rubocop to 0.39 and tweak config [puppet] - 10https://gerrit.wikimedia.org/r/330470 [13:38:15] (03CR) 10Hashar: "rebased / cleared out unrelated upgrades in Gemfile.lock" [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [13:39:48] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [13:39:52] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for asherman [puppet] - 10https://gerrit.wikimedia.org/r/330885 (https://phabricator.wikimedia.org/T152957) (owner: 10Muehlenhoff) [13:51:48] !log reedy@tin Started scap: Rebuild message cache for Echo api messages being missing T154110 [13:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:52] T154110: Echo Api Messages missing - https://phabricator.wikimedia.org/T154110 [13:53:00] (03PS1) 10Muehlenhoff: Remove access credentials for laner [puppet] - 10https://gerrit.wikimedia.org/r/330891 (https://phabricator.wikimedia.org/T152957) [13:55:28] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:47] (03CR) 10Mark Bergsma: [C: 031] Remove access credentials for laner [puppet] - 10https://gerrit.wikimedia.org/r/330891 (https://phabricator.wikimedia.org/T152957) (owner: 10Muehlenhoff) [14:07:48] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:09:03] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for laner [puppet] - 10https://gerrit.wikimedia.org/r/330891 (https://phabricator.wikimedia.org/T152957) (owner: 10Muehlenhoff) [14:10:18] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:28] PROBLEM - Nginx local proxy to apache on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:08] PROBLEM - HHVM rendering on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:13:03] 06Operations, 10ops-codfw, 10netops, 05Security: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2923393 (10Reedy) [14:16:48] !log reedy@tin Finished scap: Rebuild message cache for Echo api messages being missing T154110 (duration: 25m 00s) [14:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:53] T154110: Echo Api Messages missing - https://phabricator.wikimedia.org/T154110 [14:22:13] (03PS1) 10Muehlenhoff: Remove access credentials for declerambaul [puppet] - 10https://gerrit.wikimedia.org/r/330903 (https://phabricator.wikimedia.org/T152957) [14:24:28] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:24:35] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for declerambaul [puppet] - 10https://gerrit.wikimedia.org/r/330903 (https://phabricator.wikimedia.org/T152957) (owner: 10Muehlenhoff) [14:34:18] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [14:34:47] (03PS1) 10Muehlenhoff: Remove access credentials for srijan [puppet] - 10https://gerrit.wikimedia.org/r/330904 (https://phabricator.wikimedia.org/T152957) [14:36:32] (03PS2) 10Tim Landscheidt: icinga: Indent @ssl_settings in Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/329739 [14:37:18] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:39:59] (03CR) 10Volans: "For the main issue see my reply inline. I'm leaving the other 2 minor comments as is." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [14:42:05] (03PS9) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [14:44:45] (03CR) 10Ema: varnishstatsd: port to cachestats.CacheStatsSender (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [14:48:39] (03CR) 10Chad: [C: 04-1] "I don't like this, it's basically doing a workaround for broken git-review behavior. We should get upstream to fix git-review instead (or " [puppet] - 10https://gerrit.wikimedia.org/r/330858 (https://phabricator.wikimedia.org/T154760) (owner: 10Paladox) [14:49:42] (03PS2) 10Ema: varnishstatsd: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643) [14:52:04] (03CR) 10Ema: varnishstatsd: port to cachestats.CacheStatsSender (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [14:52:18] PROBLEM - configured eth on lvs2006 is CRITICAL: eth1 reporting no carrier. [14:52:28] PROBLEM - configured eth on lvs2005 is CRITICAL: eth1 reporting no carrier. [14:52:38] PROBLEM - configured eth on lvs2004 is CRITICAL: eth1 reporting no carrier. [14:52:50] mark: related to the reboot? ^^^ [14:53:02] likely yes [14:53:28] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for srijan [puppet] - 10https://gerrit.wikimedia.org/r/330904 (https://phabricator.wikimedia.org/T152957) (owner: 10Muehlenhoff) [14:53:28] !log papaul powercycled asw-a7-codfw 14:50 [14:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:18] RECOVERY - Host cp2006 is UP: PING WARNING - Packet loss = 54%, RTA = 36.06 ms [14:55:18] RECOVERY - IPsec on mc1004 is OK: Strongswan OK - 1 ESP OK [14:55:18] RECOVERY - Host mc2004 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [14:55:18] RECOVERY - Host cp2004 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [14:55:18] RECOVERY - Host mc2006 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [14:55:18] RECOVERY - Host ms-fe2002 is UP: PING OK - Packet loss = 0%, RTA = 37.33 ms [14:55:19] RECOVERY - Host ms-be2017 is UP: PING OK - Packet loss = 16%, RTA = 36.17 ms [14:55:19] RECOVERY - Host cp2005 is UP: PING OK - Packet loss = 16%, RTA = 36.20 ms [14:55:20] RECOVERY - configured eth on lvs2006 is OK: OK - interfaces up [14:55:28] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [14:55:28] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [14:55:28] RECOVERY - configured eth on lvs2005 is OK: OK - interfaces up [14:55:38] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [14:55:38] RECOVERY - configured eth on lvs2004 is OK: OK - interfaces up [14:55:38] RECOVERY - Host mc2005 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms [14:55:38] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [14:55:38] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 28 ESP OK [14:55:38] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [14:55:39] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [14:55:39] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [14:55:40] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [14:55:40] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [14:55:41] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [14:55:41] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [14:55:42] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [14:55:42] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [14:56:08] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [14:56:08] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [14:56:08] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [14:56:08] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [14:56:08] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [14:56:08] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [14:56:09] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [14:56:09] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [14:56:10] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [14:56:10] RECOVERY - IPsec on mc1005 is OK: Strongswan OK - 1 ESP OK [14:56:11] RECOVERY - IPsec on mc1006 is OK: Strongswan OK - 1 ESP OK [14:56:11] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK [14:56:12] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [14:56:12] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [14:57:23] (03PS10) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [14:57:28] PROBLEM - Freshness of OCSP Stapling files on cp2006 is CRITICAL: CRITICAL: File /var/cache/ocsp/globalsign-2016-rsa-unified.ocsp is more than 18300 secs old! [14:57:38] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:57:38] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:57:48] PROBLEM - Freshness of OCSP Stapling files on cp2005 is CRITICAL: CRITICAL: File /var/cache/ocsp/globalsign-2016-rsa-unified.ocsp is more than 18300 secs old! [14:57:48] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:57:48] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:57:48] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:57:48] PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:57:49] PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:57:49] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[htop],Package[tcpdump],Package[gdb],Package[lldpd] [14:58:48] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [14:59:48] RECOVERY - puppet last run on mc2006 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:59:55] (03PS1) 10Muehlenhoff: Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/330914 [15:00:58] (03CR) 10Muehlenhoff: [C: 032] Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/330914 (owner: 10Muehlenhoff) [15:01:38] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:01:48] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:03:28] RECOVERY - Freshness of OCSP Stapling files on cp2006 is OK: OK [15:03:38] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:03:48] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:07:14] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#2923502 (10Cmjohnson) [15:07:36] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2923503 (10Papaul) mw2079 ge-4/0/38 mw2080 ge-3/0/0 mw2081 ge-3/0/1 mw2082 ge-3/0/2 mw2083 ge-3/0/3 mw2084 ge-3/0/4 mw2085 ge-3/0/5 mw2086 ge-3/0/6 mw2087 ge-3/0/7 mw20... [15:07:38] RECOVERY - puppet last run on lvs2005 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:14:01] 06Operations, 10ops-eqiad: update label/racktables visible label for thumbor100[12] - https://phabricator.wikimedia.org/T153965#2923511 (10Cmjohnson) 05Open>03Resolved [15:14:28] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:15:40] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:16:14] 06Operations, 10ops-eqiad: update label/racktables visible label for labservices1002/WMF4075 - https://phabricator.wikimedia.org/T153967#2923515 (10Cmjohnson) 05Open>03Resolved [15:19:48] RECOVERY - Freshness of OCSP Stapling files on cp2005 is OK: OK [15:21:48] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:22:28] !log elastic2025-elastic2036 - signing puppet certs, salt-key, initial run [15:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:48] RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:25:48] RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:26:29] 06Operations, 06Operations-Software-Development: Puppet compiler: order resources for easy comparison between hosts - https://phabricator.wikimedia.org/T154776#2923517 (10Volans) [15:28:08] 06Operations, 10ops-eqiad: Rack and setup wdqs1003 - https://phabricator.wikimedia.org/T153349#2923531 (10Cmjohnson) [15:28:12] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643#2923533 (10Cmjohnson) [15:29:12] (03CR) 10Volans: "I've fixed a bunch of issues and now the puppet compiler results seems ok to me." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:29:25] !log powering off mw1239 to reseat DIMM [15:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:54] (03PS3) 10Ema: varnishstatsd: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643) [15:35:31] thanks cmjohnson1! [15:36:29] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2923539 (10Cmjohnson) @elukey DIMM A1 swapped with B1. Let's see what happens [15:40:45] 06Operations, 10ops-codfw, 10netops, 05Security: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2923547 (10mark) The switch has been brought back up with a hard power cycle. We don't have a real indication yet of why it crashed, there's nothing concrete in the logs (local or otherwise),... [15:41:00] 06Operations, 10ops-codfw, 10netops, 05Security: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2923548 (10mark) p:05Unbreak!>03Normal [15:41:49] (03PS1) 10Cmjohnson: Removing final dns entries for mw1017 and mw1099 T151303 [dns] - 10https://gerrit.wikimedia.org/r/330918 [15:42:28] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:46:34] 06Operations: setup/install mwlog1001/WMF4724 - https://phabricator.wikimedia.org/T153361#2923553 (10Cmjohnson) [15:49:53] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): restbase-test100[13] lost power redundancy - https://phabricator.wikimedia.org/T153248#2923554 (10Cmjohnson) @robh @fgiunchedi Both of the servers on running on one PSU. Both are very much out of warranty. Is there any action to take here? [15:50:08] (03CR) 10Cmjohnson: [C: 032] Removing final dns entries for mw1017 and mw1099 T151303 [dns] - 10https://gerrit.wikimedia.org/r/330918 (owner: 10Cmjohnson) [15:50:53] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151295#2923559 (10Cmjohnson) [15:50:55] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Hardware decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151303#2923557 (10Cmjohnson) 05Open>03Resolved Removed the remaining DNS entries. [15:51:00] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): restbase-test100[13] lost power redundancy - https://phabricator.wikimedia.org/T153248#2923560 (10RobH) Are there any pending decom hosts unracked that have the same power supplies so we can steal them? [15:58:49] (03PS1) 10Papaul: Add elastic2025-elastic2036 Bug: T154251 [puppet] - 10https://gerrit.wikimedia.org/r/330923 (https://phabricator.wikimedia.org/T154251) [15:59:15] 06Operations, 10ops-eqiad, 10Cassandra, 13Patch-For-Review, 06Services (blocked): setup/install restbase-dev100[123] - https://phabricator.wikimedia.org/T151075#2923568 (10Cmjohnson) [15:59:17] 06Operations, 10ops-eqiad: Rename/relabel restbase-test1* to restbase-dev1* - https://phabricator.wikimedia.org/T154629#2923566 (10Cmjohnson) 05Open>03Resolved Switch ports, racktables updated...resolving [15:59:43] volans: can you please review and mere this [15:59:50] volans: https://gerrit.wikimedia.org/r/#/c/330923/ [15:59:53] volans: thanks [15:59:58] * volans looking [16:01:58] PROBLEM - Host 208.80.155.118 is DOWN: CRITICAL - Host Unreachable (208.80.155.118) [16:02:48] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.155.117) [16:03:46] ^ andrewbogott? [16:04:05] chasemp: ah yeah, just missed an alert to silence... [16:04:18] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [16:04:57] (03CR) 10Volans: [C: 032] Add elastic2025-elastic2036 Bug: T154251 [puppet] - 10https://gerrit.wikimedia.org/r/330923 (https://phabricator.wikimedia.org/T154251) (owner: 10Papaul) [16:05:19] !log wiping codfw caches T154758 [16:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:52] T154758: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758 [16:06:59] (03PS3) 10Hashar: beta::autoupdater: Stop wmf-beta-mwconfig-update being a template just to get the staging dir [puppet] - 10https://gerrit.wikimedia.org/r/322408 (owner: 10Alex Monk) [16:07:09] (03PS1) 10Muehlenhoff: Update to 4.4.40 [debs/linux44] - 10https://gerrit.wikimedia.org/r/330926 [16:07:10] papaul: merged and puppet-merged [16:07:18] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:07:30] (03CR) 10Hashar: [C: 031] "I have cherry picked it on the beta cluster puppet master. Ran the Jenkins build and it seems all happy https://integration.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/322408 (owner: 10Alex Monk) [16:07:42] volans: thanks will try again after a get some food [16:07:52] sure! :) [16:07:57] hope it works [16:08:59] volans: i just did one to be sure yes it is working [16:09:03] volans: thanks [16:09:34] yw [16:09:38] PROBLEM - PyBal backends health check on lvs2002 is CRITICAL: PYBAL CRITICAL - uploadlb_443 - Could not depool server cp2026.codfw.wmnet because of too many down! [16:09:48] PROBLEM - PyBal backends health check on lvs2005 is CRITICAL: PYBAL CRITICAL - uploadlb_443 - Could not depool server cp2008.codfw.wmnet because of too many down!: uploadlb6_443 - Could not depool server cp2026.codfw.wmnet because of too many down! [16:10:00] ema: ^^^ [16:10:50] volans: that's probably because of the cache wipes, codfw is depooled anyways [16:11:04] I know it's depooled, just FYI [16:11:11] yeah thanks :) [16:11:18] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:11:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:11:43] uh [16:12:18] PROBLEM - Swift HTTP backend on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:18] PROBLEM - Swift HTTP backend on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:19] PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:19] PROBLEM - Swift HTTP backend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:19] PROBLEM - Swift HTTP frontend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:29] wat? [16:12:38] PROBLEM - Swift HTTP frontend on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:38] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - swift_80 - Could not depool server ms-fe1003.eqiad.wmnet because of too many down! [16:12:44] PROBLEM - LVS HTTP IPv4 on ms-fe.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:48] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - swift_80 - Could not depool server ms-fe1003.eqiad.wmnet because of too many down! [16:12:49] eqiad? [16:12:50] that doesn't look good [16:13:08] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [16:13:09] RECOVERY - Host 208.80.155.118 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:13:09] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [16:13:09] PROBLEM - Swift HTTP frontend on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:09] PROBLEM - Swift HTTP frontend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:15] (03CR) 10Hashar: [C: 031] toollabs/CI: give banner scripts an .sh extension [puppet] - 10https://gerrit.wikimedia.org/r/327673 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [16:13:19] though I can still view images [16:13:27] <_joe_> cached ones I guess [16:14:03] hmm [16:14:04] looking now too [16:14:14] yeah, there are issues [16:14:18] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [16:14:23] I tried some different sizes but I guess all the ones I happened to try were cached [16:14:40] a 322px version gives HTTP 503 [16:14:48] PROBLEM - Check HHVM threads for leakage on mw1295 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [16:14:55] !log Restarting Nodepool [16:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:01] <_joe_> something bad is happening and I'm not sure what [16:15:34] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Replace fans (or paste) on labservices1001 - https://phabricator.wikimedia.org/T154391#2923595 (10Cmjohnson) @Andrew Replaced the thermal paste on labservices1001....it didn't look dry and crusty so not 100% it will fix the i... [16:15:40] Images are down https://commons.wikimedia.org/wiki/File:The_Adoration_of_the_Magi_(Matthias_Stom)_-_Nationalmuseum_-_18796.jpg [16:15:45] I merged a change yesterday to rewrite.py, it might a side effect of that [16:15:47] <_joe_> paladox: we know [16:15:48] RECOVERY - Check HHVM threads for leakage on mw1295 is OK: OK [16:15:57] <_joe_> godog: no request is getting to the scalers [16:15:58] one I just tried said "Error creating thumbnail: File missing"? [16:15:59] oh ok [16:16:08] RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.020 second response time [16:16:14] I get If you report this error to the Wikimedia System Administrators, please include the details below. [16:16:14] Request from 104.131.110.123 via cp1072 cp1072, Varnish XID 104333370 [16:16:14] Error: 503, Backend fetch failed at Fri, 06 Jan 2017 16:16:06 GMT [16:17:23] !log bounce swift-proxy on ms-fe100[123] leave ms-fe1004 for investigation [16:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:18] PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:25] Although this PHP script (/w/index.php) exists, the file requested for output (mwstore://global-swift-eqiad/captcha-render/5/a/f/image_aa6f9790_5af00d48c7134951.png) does not. [16:21:37] guess because of the swift issues? [16:21:41] <_joe_> yes [16:22:07] ok, I'll wait until it's fixed. [16:22:28] PROBLEM - Check size of conntrack table on ms-fe1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [16:24:18] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:25:08] PROBLEM - Check size of conntrack table on ms-fe1003 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [16:25:28] RECOVERY - Check size of conntrack table on ms-fe1002 is OK: OK: nf_conntrack is 44 % full [16:25:36] I'm temporatily bumping conntrack table sizes on ms-fe1* [16:27:03] RECOVERY - Check size of conntrack table on ms-fe1003 is OK: OK: nf_conntrack is 44 % full [16:28:33] PROBLEM - Elasticsearch HTTPS on elastic2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:28:54] ^ [16:29:03] is a new elastic node [16:29:13] PROBLEM - puppet last run on elastic2026 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins] [16:29:53] PROBLEM - puppet last run on elastic2025 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 57 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins] [16:30:57] Hi [16:31:07] image issues ShakespeareFan00? [16:31:09] You probably alrready know this [16:31:15] yes [16:31:21] Someone in -commons also just complained [16:31:23] but I'm getting a backend error when accessing images... [16:31:24] fwiw [16:31:33] PROBLEM - Elasticsearch HTTPS on elastic2026 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2026.codfw.wmnet [16:31:49] dcausse, these elasticsearch https alerts in codfw nothing to worry about? [16:31:53] Not a major issue, but a pain when you need to do proof-reading on Wikisource which needs the scan images [16:32:07] Krenair: nope these are new nodes being racked [16:32:07] (03PS1) 10RobH: decom mw2080-2085 [puppet] - 10https://gerrit.wikimedia.org/r/330930 [16:32:55] (03CR) 10RobH: [C: 032] decom mw2080-2085 [puppet] - 10https://gerrit.wikimedia.org/r/330930 (owner: 10RobH) [16:34:55] hashar: https://doc.wikimedia.org/puppet is nice! [16:35:43] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2923626 (10RobH) >>! In T154621#2923111, @MoritzMuehlenhoff wrote: > This morning a deployment by Ariel of a mw-config throttling change failed since scap tried to connect to mw20... [16:36:13] hashar: do you know why the class docstring shows here https://doc.wikimedia.org/puppet/puppet_classes/restbase.html, but not here https://doc.wikimedia.org/puppet/puppet_classes/cassandra.html? is it that extra new line after before the class definition? [16:36:28] bblack: friendly remainder that as you mentioned we will plan for this work to be done in q3: https://phabricator.wikimedia.org/T138027 [16:37:13] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [16:37:46] (03PS1) 10Giuseppe Lavagetto: Depooling ulsfo upload caches [dns] - 10https://gerrit.wikimedia.org/r/330931 [16:37:55] <_joe_> mark, ema ^^ [16:38:13] <_joe_> but esams and eqiad have the same issue [16:38:20] <_joe_> ganglia lost them [16:38:23] <_joe_> so maybe it's not that [16:38:28] not sure this is helpful really [16:42:30] 06Operations, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2923650 (10Papaul) [16:42:32] 06Operations, 10ops-codfw: update the label and racktables entry for gerrit2001/WMF6408 & install SSDs - https://phabricator.wikimedia.org/T152527#2923648 (10Papaul) 05Open>03Resolved Disks installation complete. [16:43:33] (03CR) 10Hashar: [C: 04-1] elasticsearch tool [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 (owner: 10Gehel) [16:44:57] PROBLEM - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:45:15] w [16:46:34] 06Operations, 10ops-codfw: update the label and racktables entry for gerrit2001/WMF6408 & install SSDs - https://phabricator.wikimedia.org/T152527#2923676 (10demon) Awesome thanks! [16:46:48] (03CR) 10Jforrester: [C: 031] "We should get this done sooner rather than later to avoid T154110-like issues." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328482 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [16:46:48] RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 863 bytes in 0.348 second response time [16:47:24] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2923679 (10RobH) [16:50:26] (03PS1) 10RobH: decom mw2075-2089 [dns] - 10https://gerrit.wikimedia.org/r/330937 [16:50:41] Hey folks, I'm struggling to upload a patchset against a non-default branch for a repo in gerrit. I see that the puppet repo has more than one branch maintained so I was hoping that someone from operations could give me some pointers. [16:50:46] Ops puppet branches: https://gerrit.wikimedia.org/r/#/admin/projects/operations/puppet,branches [16:51:01] ORES wheels branches: https://gerrit.wikimedia.org/r/#/admin/projects/research/ores/wheels,branches [16:51:03] (03CR) 10RobH: [C: 032] decom mw2075-2089 [dns] - 10https://gerrit.wikimedia.org/r/330937 (owner: 10RobH) [16:51:43] When I try to "git review -R wmflabs", I get http://pastebin.ca/3753822 [16:52:33] My local branch (called update_libraries) is based on the wmflabs branch and has one additional commit. [16:53:07] halfak: outtage reasoning and response is in progress for awhile I imagine fyi [16:53:48] chasemp, there's an outage in progress? [16:54:38] halfak: image serving is/was down and is/was under extreme duress yes (see topic) [16:54:44] gotcha. [16:54:57] Thanks for the heads up. [16:59:26] PROBLEM - Elasticsearch HTTPS on elastic2028 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2028.codfw.wmnet [16:59:46] PROBLEM - Elasticsearch HTTPS on elastic2034 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2034.codfw.wmnet [16:59:46] PROBLEM - puppet last run on elastic2036 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/lib/json-simple.jar],Package[elasticsearch/plugins] [16:59:56] PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:16] PROBLEM - Elasticsearch HTTPS on elastic2027 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2027.codfw.wmnet [17:00:16] PROBLEM - puppet last run on elastic2029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 19 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins] [17:00:16] RECOVERY - puppet last run on elastic2025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:00:26] PROBLEM - Elasticsearch HTTPS on elastic2032 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2032.codfw.wmnet [17:00:26] PROBLEM - puppet last run on elastic2035 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/lib/json-simple.jar],Package[elasticsearch/plugins] [17:00:56] PROBLEM - puppet last run on elastic2028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins] [17:01:06] PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:01:06] PROBLEM - Elasticsearch HTTPS on elastic2031 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2031.codfw.wmnet [17:01:06] PROBLEM - puppet last run on elastic2034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins] [17:01:09] gehel: dcausse ^ [17:01:26] PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:01:36] PROBLEM - puppet last run on elastic2027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins] [17:01:47] chasemp: those are new hosts [17:01:49] chasemp: these are new nodes, is it possible to silence all icinga alerts for all elastic2025 and up? [17:01:56] PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:01:56] PROBLEM - Elasticsearch HTTPS on elastic2030 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2030.codfw.wmnet [17:01:56] PROBLEM - puppet last run on elastic2032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins] [17:01:56] PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:02:06] PROBLEM - Elasticsearch HTTPS on elastic2036 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2036.codfw.wmnet [17:02:06] RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational [17:02:14] papaul: if you're not around I'll silence them [17:02:26] dcausse: I don't know of a way to silence hosts before they show up like that, you gotta try to catch it afaik :) [17:02:36] PROBLEM - Elasticsearch HTTPS on elastic2029 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2029.codfw.wmnet [17:02:36] PROBLEM - puppet last run on elastic2031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 24 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins] [17:02:40] but cool no worries, just pinging to make sure [17:02:46] sure [17:02:46] RECOVERY - puppet last run on elastic2036 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [17:02:46] PROBLEM - Elasticsearch HTTPS on elastic2035 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:03:40] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2923787 (10Niharika) Hi @Shoichi is the translation work currently in progress? [17:03:48] thanks volans [17:05:06] 06Operations, 10ops-codfw: decommission mw2075-2089 to make room for new systems - https://phabricator.wikimedia.org/T154621#2923788 (10RobH) a:05RobH>03Papaul [17:05:06] PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:05:49] 06Operations, 10ops-codfw: decommission mw2075-2089 to make room for new systems - https://phabricator.wikimedia.org/T154621#2918062 (10RobH) This task is now assigned to @papaul for the disk wipes. Once the disks are wiped and the systems are pulled from the racks, I'll remove their network port entries and... [17:05:57] chasemp, dcausse, papaul: I've set them in downtime until monday around this time, increase it if needed [17:06:08] 06Operations, 10ops-codfw: decommission mw2075-2089 to make room for new systems - https://phabricator.wikimedia.org/T154621#2923807 (10RobH) [17:06:20] volans: you're a scholar and gentlemen [17:06:20] volans: thanks! [17:06:47] :) [17:06:58] you're welcome, sir ;) [17:08:20] volans: thanks [17:09:16] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2923813 (10Papaul) [17:09:36] RECOVERY - puppet last run on elastic2031 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:09:36] RECOVERY - puppet last run on elastic2027 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [17:09:53] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2905389 (10Papaul) a:05Papaul>03Gehel @Gehel you can take over. [17:10:37] Any ETA on the image backend being up? [17:11:56] It's ready when it's fixed? [17:12:16] RECOVERY - puppet last run on elastic2029 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:12:56] <_joe_> ShakespeareFan00: not really, sorry [17:13:16] andre__: OKay not a major concern [17:13:18] <_joe_> we're still trying to figure out how to mitigate this [17:13:27] Also saving text is really slow for me right now [17:13:40] And I had an https connection failure [17:14:50] <_joe_> ShakespeareFan00: related, sadly [17:15:06] <_joe_> but text is partly uncached but should work mostly fine [17:16:45] Also when I KNOW I've edited a page the changes aren't propogating - https://en.wikipedia.org/w/index.php?title=Special%3AWhatLinksHere&limit=500&hidelinks=1&target=Template%3Affdc&namespace= [17:16:55] containg links I KNOW I've removed. [17:17:02] (links or template inclusions) [17:17:28] jobque is not affected [17:17:38] but in any case, be patient [17:19:56] RECOVERY - puppet last run on elastic2032 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [17:21:56] RECOVERY - puppet last run on elastic2028 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:22:27] <_joe_> yeah an unpredicted event wiped out most of our caches [17:23:12] (03CR) 10Chad: Gerrit: Enable logstash in gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [17:23:16] RECOVERY - puppet last run on elastic2026 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:25:18] (03CR) 10Chad: Gerrit: Enable g1 gc as we now use java 8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [17:25:31] (03PS3) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) [17:25:40] (03PS4) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) [17:26:00] (03CR) 10Chad: "useUnicode is ok. Do we want/need the connectionCollation change yet since we haven't adjusted the DB?" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [17:26:02] (03CR) 10Paladox: Gerrit: Enable logstash in gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [17:26:33] (slowly) [17:27:21] (03CR) 10Chad: "One minor nit, otherwise this is fine" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [17:27:31] (03CR) 10Chad: "We should go ahead with this." [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox) [17:27:57] RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational [17:27:59] (03CR) 10Paladox: "> useUnicode is ok. Do we want/need the connectionCollation change" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [17:28:17] RECOVERY - puppet last run on elastic2034 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:28:27] RECOVERY - puppet last run on elastic2035 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:28:50] (03CR) 10Paladox: "Is there a date? Should we book a date for this as i think gerrit needs to be down for this." [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox) [17:28:57] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:29:57] PROBLEM - Check systemd state on elastic2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:29:58] (03CR) 10Paladox: "> Is there a date? Should we book a date for this as i think gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox) [17:30:31] (03CR) 10Chad: "I think this can be abandoned. Looking at the task, it looks like we'll be going in a different direction--looks like we'll allow a policy" [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) (owner: 10Paladox) [17:30:47] PROBLEM - Elasticsearch HTTPS on elastic2033 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2033.codfw.wmnet [17:30:57] (03CR) 10Paladox: Add support for searching gerrit using bug:T1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [17:31:11] (03Abandoned) 10Paladox: phabricator: allow mirroring from git.legoktm.com into Diffusion [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) (owner: 10Paladox) [17:31:29] (03CR) 10Chad: "Would need a date for the brief downtime for applying the config, applying the change and bringing back up. Shouldn't take more than 5-10 " [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox) [17:31:44] dcausse, papaul: I guess we have the SSL checks too :( [17:31:58] (03CR) 10Chad: [C: 031] "Silly gerrit. Fair enough :)" [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [17:32:15] there's no way to silence all alerts from a particular host? :/ [17:32:45] (03CR) 10Paladox: "Yep lol." [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [17:33:08] dcausse: I did, probably 2033 come up just now [17:33:17] it doesn't have the downtime [17:33:23] oh ok [17:33:48] (03PS5) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) [17:34:18] dcausse: added there too [17:34:23] thanks :) [17:40:27] (03CR) 10Paladox: "> Would need a date for the brief downtime for applying the config," [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox) [17:42:45] (03CR) 10Paladox: "@Chad can this be merged? I disabled the host part so logstash won't be enabled by default." [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [17:43:34] (03PS6) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) [17:45:58] (03CR) 10Paladox: "We will want to keep an eye out for a week to make sure this does not cause any side affects + make sure every's usernames are converted t" [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox) [17:47:35] (03PS11) 10Hashar: Modification of Rakefile spec entry point [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) (owner: 10Nicko) [17:48:09] (03PS1) 10Giuseppe Lavagetto: swift: move thumbs handling to codfw temporarily [puppet] - 10https://gerrit.wikimedia.org/r/330950 [17:48:50] (03Abandoned) 10Giuseppe Lavagetto: Depooling ulsfo upload caches [dns] - 10https://gerrit.wikimedia.org/r/330931 (owner: 10Giuseppe Lavagetto) [17:49:59] (03PS4) 10Hashar: Use task to run modules spec [puppet] - 10https://gerrit.wikimedia.org/r/307223 [17:50:17] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:53] (03CR) 10jerkins-bot: [V: 04-1] Use task to run modules spec [puppet] - 10https://gerrit.wikimedia.org/r/307223 (owner: 10Hashar) [17:54:08] (03PS21) 10Dzahn: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [17:56:08] (03CR) 10Tim Landscheidt: "> […]" [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) (owner: 10BryanDavis) [17:59:06] hi [18:00:09] (03CR) 10BBlack: [C: 031] swift: move thumbs handling to codfw temporarily [puppet] - 10https://gerrit.wikimedia.org/r/330950 (owner: 10Giuseppe Lavagetto) [18:00:55] (03Abandoned) 10Tim Landscheidt: WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [18:00:59] (03CR) 10Filippo Giunchedi: [C: 032] swift: move thumbs handling to codfw temporarily [puppet] - 10https://gerrit.wikimedia.org/r/330950 (owner: 10Giuseppe Lavagetto) [18:01:16] (03PS1) 10Dzahn: delete icinga SSL cert, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/330957 [18:01:26] (03Abandoned) 10Tim Landscheidt: Tools: Migrate from bigbrother to bigbrothermonitor [puppet] - 10https://gerrit.wikimedia.org/r/234051 (owner: 10Tim Landscheidt) [18:02:45] (03CR) 10Dzahn: "when doing this, the key should be deleted in private repo too" [puppet] - 10https://gerrit.wikimedia.org/r/330957 (owner: 10Dzahn) [18:03:27] (03PS3) 10Dzahn: icinga: Indent @ssl_settings in Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/329739 (owner: 10Tim Landscheidt) [18:03:47] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:04:27] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [18:04:42] (03CR) 10Dzahn: [C: 032] icinga: Indent @ssl_settings in Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/329739 (owner: 10Tim Landscheidt) [18:07:27] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:07:27] 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2923652 (10jcrespo) [18:08:01] 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2923652 (10jcrespo) @Aklapper The problems have been minimized, but this is still WIP. [18:08:23] (03PS1) 10Andrew Bogott: Puppetmaster: Remove remnants of ldap node definitions. [puppet] - 10https://gerrit.wikimedia.org/r/330959 (https://phabricator.wikimedia.org/T148781) [18:08:31] !log force puppet run on eqiad cache_upload to switch thumbs to codfw [18:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:52] 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2923953 (10jcrespo) [18:10:42] 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2923652 (10jcrespo) [18:10:47] RECOVERY - Swift HTTP frontend on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.001 second response time [18:10:47] RECOVERY - Swift HTTP frontend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.001 second response time [18:10:47] RECOVERY - Swift HTTP backend on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.011 second response time [18:11:03] RECOVERY - LVS HTTP IPv4 on ms-fe.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.010 second response time [18:11:04] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [18:11:08] nice [18:11:14] * apergos crosses fingers [18:11:17] RECOVERY - Swift HTTP frontend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.001 second response time [18:11:17] RECOVERY - Swift HTTP frontend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.001 second response time [18:11:17] RECOVERY - Swift HTTP backend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.009 second response time [18:11:17] RECOVERY - Swift HTTP backend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.009 second response time [18:11:17] RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.008 second response time [18:11:37] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [18:11:41] <_joe_> apergos: we just shifted some traffic [18:11:50] yep been lurking [18:12:03] let's hope codfw handles it smoothly [18:13:17] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.032 second response time [18:13:17] RECOVERY - Nginx local proxy to apache on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.044 second response time [18:13:17] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 75216 bytes in 0.124 second response time [18:13:25] !log mw1205 - restarted hhvm [18:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:08] 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2923982 (10Elisfkc) 05Open>03Resolved a:03Elisfkc Got upload wizard and Flickr2Common... [18:15:47] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [18:17:26] (03PS2) 10Andrew Bogott: Puppetmaster: Remove remnants of ldap node definitions. [puppet] - 10https://gerrit.wikimedia.org/r/330959 (https://phabricator.wikimedia.org/T148781) [18:19:17] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:22:06] (03PS2) 10Tim Landscheidt: Tools: Undo obsolete /var/mail customization [puppet] - 10https://gerrit.wikimedia.org/r/326306 [18:22:27] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:23:27] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:23:28] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [18:24:27] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:32:24] _joe_: Well, whatever you guys broke had a quite positive effect on videoscaler ‘load’, lol. [18:32:47] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:32:50] some uploads were failing for a while [18:33:03] probably helped things on the videoscalers [18:34:39] Krenair: Every queued transcode received a ‘transcode_time_error’ at about 201701061611 [18:34:40] (03PS4) 10Tim Landscheidt: Set SYS_UID_MAX and SYS_GID_MAX to 499 [puppet] - 10https://gerrit.wikimedia.org/r/326311 (https://phabricator.wikimedia.org/T45795) [18:35:23] they all timed out at the same time, when the issues started to occur? [18:36:06] Krenair: It appears to be the same behavior as the when the scalers were rebooted last month. [18:36:18] (in that they all got an error time) [18:36:29] https://quarry.wmflabs.org/query/14842 [18:36:52] (03CR) 10Andrew Bogott: [C: 032] Set SYS_UID_MAX and SYS_GID_MAX to 499 [puppet] - 10https://gerrit.wikimedia.org/r/326311 (https://phabricator.wikimedia.org/T45795) (owner: 10Tim Landscheidt) [18:39:02] (03PS2) 10Andrew Bogott: toollabs: remove host aliases for tools-exec-121[2-6] [puppet] - 10https://gerrit.wikimedia.org/r/330333 (https://phabricator.wikimedia.org/T154539) (owner: 10BryanDavis) [18:39:06] Revent, does it record the error somewhere? [18:39:23] 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2924063 (10czar) Special:Upload is back up for me after being down for a while (T154790). I... [18:40:21] Krenair: There is a error field (that I am not displaying in that search) but it looks like they is all ‘source not found' [18:40:39] (03CR) 10Andrew Bogott: [C: 032] toollabs: remove host aliases for tools-exec-121[2-6] [puppet] - 10https://gerrit.wikimedia.org/r/330333 (https://phabricator.wikimedia.org/T154539) (owner: 10BryanDavis) [18:41:24] strange error [18:41:32] https://quarry.wmflabs.org/query/15278 [18:41:44] You can see it there on a couple of the transcodes. [18:41:46] but it's possible this is just what happened when Swift was overwhelmed [18:42:41] Krenair: TBH, if it ‘just happened’ to choose today to crap out everything in the queue… that’s not necessarily a bad thing. [18:43:22] https://phabricator.wikimedia.org/T154733 <- because of what this is meant to address [18:43:53] I doubt it's unrelated [18:48:54] (03PS22) 10Dzahn: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [18:51:44] (03CR) 10Dzahn: [C: 032] Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [18:52:09] Revent, do you need to re-queue stuff? [18:52:36] Krenair: Eventually, yeah, but kinda waiting to watch what’s happening. [18:52:44] yep [18:52:50] (I can’t see the ‘real’ queue) [18:52:53] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [18:52:58] I imagine there'll be a report at some point [18:53:27] Krenair: Can you check what the backlog in the ‘real’ queue looks like? [18:53:46] I don't really know enough about the video scaling systems [18:53:47] sorry [18:54:20] (03PS2) 10Tim Landscheidt: Tools: Update list of host aliases for mail relay [puppet] - 10https://gerrit.wikimedia.org/r/326308 [18:54:51] (nods) Apparently, the ‘status’ of the file in the DB as ‘queued’ is not what controls them actually being run, there is a actual software queue somewhere. [18:55:13] probably job queue jobs I would guess [18:56:48] "webVideoTranscode: 0 queued; 5878 claimed (4660 active, 1218 abandoned); 0 delayed" [18:56:59] bd808: Awesome! [18:57:17] that's from "$ mwscript showJobs.php --wiki=commonswiki --group" [18:57:38] It’s that ‘0 queued’, in that the pile of huge crap people keept resetting is flushed out. :) [18:59:02] Krenair: And yes, I’m qould to has to requeue a couple of thousand, but… not at a rate that will keep the scalers from being able to handle new uploads. [18:59:12] *going to have to requeue [18:59:19] !log gerrit restarting for config change 308753 - will be back in seconds [18:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] yep [19:00:06] thanks Revent [19:00:33] Yeah, it’s my life recently. (lol) :P [19:01:09] hashar: hi, did you want this one? https://gerrit.wikimedia.org/r/#/c/328051/ [19:02:39] (03PS3) 10Dzahn: toollabs/CI: give banner scripts an .sh extension [puppet] - 10https://gerrit.wikimedia.org/r/327673 (https://phabricator.wikimedia.org/T148494) [19:02:55] mutante: havent came accross that one yet :( [19:03:04] hashar: no rush, later then [19:03:49] mutante: it is rather nasty change with a race condition between upstart and mysql data being on a tmpfs [19:03:52] 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2924104 (10jcrespo) @czar There will be a post mortem, as usual, on https://wikitech.wikime... [19:06:50] hashar: yep, not now then. ack [19:07:21] !log gerrit: Started full reindex of all changes, should be background but will be watching [19:07:22] 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2924114 (10czar) That's great but I meant that as a largely non-technical user, I didn't kn... [19:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:44] mutante: the rough idea is that when a machine boot, the mysql service should only start after a tmpfs has been created at /var/lib/mysql [19:07:51] so yeah need a bunch of testing [19:10:07] (03CR) 10Dzahn: [C: 032] toollabs/CI: give banner scripts an .sh extension [puppet] - 10https://gerrit.wikimedia.org/r/327673 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [19:10:37] greg-g: hi… i'm gonna need to have something deployed today real quick :( [19:10:39] hashar: yea, agreee! [19:10:41] (03PS3) 10Andrew Bogott: Openstack: Forward some custom config changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/330626 [19:11:46] (03CR) 10Andrew Bogott: [C: 032] Openstack: Forward some custom config changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/330626 (owner: 10Andrew Bogott) [19:12:11] (03CR) 10Hashar: [C: 04-1] "That is not going to fix the issue. The problem is we explicitly mark the mysql service to be started manually on boot, and whenever pupp" [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) (owner: 10Paladox) [19:12:28] greg-g: https://phabricator.wikimedia.org/T154779 / https://gerrit.wikimedia.org/r/330974 [19:12:30] (03PS4) 10Dzahn: base: add lshw to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/328952 [19:12:39] (03PS2) 10Andrew Bogott: Nova: turn off ec2 api [puppet] - 10https://gerrit.wikimedia.org/r/330838 [19:13:06] (03Abandoned) 10Paladox: Contint: notify service mysql on creation of mysql dir [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) (owner: 10Paladox) [19:13:21] (03CR) 10Dzahn: "please re-add me when it's ready to go" [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox) [19:13:40] greg-g: we'd have noticed this problem yesterday if the train wasn't delayed :( [19:15:03] (03CR) 10Dzahn: "please re-add me when it's time for this (needs the next scheduled maintenance window, right)" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [19:16:01] MatmaRex: looks like the sort of thing that should be ok for a Friday [19:16:08] MatmaRex: +2 [19:16:13] Start the backport process [19:16:22] thanks folks [19:17:02] !log restarting elasticsearch on relforge100[12] to test new search-ltr plugin [19:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:25] (03CR) 10Dzahn: [C: 032] base: add lshw to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/328952 (owner: 10Dzahn) [19:18:52] ostriches: bd808: wmf.7 backport is https://gerrit.wikimedia.org/r/#/c/330975/ [19:19:25] robh: ^ you will now get "lshw" everywhere (again) [19:19:27] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 165 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 162, number_of_pending_tasks: 9, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 196, task_max_waiting_in_queue_millis: 1076, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: [19:19:27] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 147 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 144, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 214, task_max_waiting_in_queue_millis: 427, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: [19:19:35] mutante: huzzah! [19:20:05] ugh, and right when i said.. [19:20:27] RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 275, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 361, initial [19:20:27] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 275, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 361, initial [19:20:32] ^this is just the restart, right? [19:20:34] (03PS1) 10Dzahn: Revert "base: add lshw to standard packages" [puppet] - 10https://gerrit.wikimedia.org/r/330976 [19:21:02] what the heck "E: Unable to locate package lswh [19:21:11] but manually it's totally there [19:21:18] and installs without any problems [19:21:37] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:21:47] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:21:47] mutante lshw [19:21:50] (03CR) 10Dzahn: [C: 032] Revert "base: add lshw to standard packages" [puppet] - 10https://gerrit.wikimedia.org/r/330976 (owner: 10Dzahn) [19:21:54] sigh [19:21:57] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:21:57] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:06] what a stupid typo [19:22:13] lol [19:22:16] what is that, BTW? [19:22:17] PROBLEM - puppet last run on wdqs1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:28] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 19 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:28] PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh],Etcd_user[root] [19:22:28] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:28] PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:28] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:28] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:28] PROBLEM - puppet last run on mc1025 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:29] PROBLEM - puppet last run on hassium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:29] PROBLEM - puppet last run on bohrium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:30] PROBLEM - puppet last run on kafka1013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:30] PROBLEM - puppet last run on wtp1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:31] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:36] oh, ls hw [19:22:46] I got confused with the other command [19:22:47] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:47] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:47] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:47] PROBLEM - puppet last run on wtp2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:48] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:48] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:48] jynus it was a typo [19:22:49] PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:49] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:22:50] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh] [19:23:05] jynus: yea, to list hardware details mostly for dc-ops [19:23:14] yes, I know it [19:23:32] it was the typo one that got me confused [19:24:05] yea :/ [19:24:32] i'm getting the bot back asap [19:24:40] just to avoid the spam [19:24:55] (03CR) 10Paladox: "probably want to create a follow up with the typo fixed :)" [puppet] - 10https://gerrit.wikimedia.org/r/330976 (owner: 10Dzahn) [19:25:13] (03CR) 10Andrew Bogott: [C: 032] Nova: turn off ec2 api [puppet] - 10https://gerrit.wikimedia.org/r/330838 (owner: 10Andrew Bogott) [19:25:26] (03PS3) 10Andrew Bogott: Nova: turn off ec2 api [puppet] - 10https://gerrit.wikimedia.org/r/330838 [19:25:53] (03PS1) 10Dzahn: Revert "Revert "base: add lshw to standard packages"" [puppet] - 10https://gerrit.wikimedia.org/r/330979 [19:26:18] ostriches: you'll be deploying that, right? [19:26:39] I can ya [19:26:57] thanks [19:27:46] (03PS2) 10Dzahn: Revert "Revert "base: add lshw to standard packages"" [puppet] - 10https://gerrit.wikimedia.org/r/330979 [19:29:22] !log demon@tin Synchronized php-1.29.0-wmf.7/extensions/UploadWizard/resources/mw.UploadWizard.js: I32e0b8f81ca2a2e9ffc0c3a379921e12465815f2 (duration: 00m 59s) [19:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:02] (03CR) 10Dzahn: [C: 032] Revert "Revert "base: add lshw to standard packages"" [puppet] - 10https://gerrit.wikimedia.org/r/330979 (owner: 10Dzahn) [19:30:06] MatmaRex: All done [19:33:25] ostriches: hmm, i'm still seeing the old code [19:33:31] Cache? [19:33:51] probably not [19:33:52] Commit is showing on tin [19:33:56] "Ignore 'bad-prefix' warning on the Upload step" [19:34:00] ostriches: hmm, you synced only that one file? it's the wrong file :D [19:34:06] Blah [19:34:08] Whoops [19:34:15] resources/mw.UploadWizardUpload.js [19:34:49] Whole dir this time :p [19:35:14] !log demon@tin Synchronized php-1.29.0-wmf.7/extensions/UploadWizard/resources: I32e0b8f81ca2a2e9ffc0c3a379921e12465815f2 (duration: 00m 40s) [19:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:29] (03CR) 10Chad: [C: 031] "Then lets go ahead with this." [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [19:36:38] (03PS4) 10Andrew Bogott: Nova: turn off ec2 api [puppet] - 10https://gerrit.wikimedia.org/r/330838 [19:37:14] (03CR) 10Dzahn: "i dunno, something about www-data running ssh gives me a bad feeling here. adding Moritz for a second opinion" [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4) [19:38:08] ostriches: ok, this is ridiculous, but i am still seeing the old code [19:38:16] i am looking at https://commons.wikimedia.org/w/extensions/UploadWizard/resources/mw.UploadWizardUpload.js [19:38:27] there's no 'bad-prefix' in this file [19:38:31] I see new [19:38:33] Local cache? [19:38:43] curl https://commons.wikimedia.org/w/extensions/UploadWizard/resources/mw.UploadWizardUpload.js | grep bad-prefix [19:39:05] case 'bad-prefix': [19:39:05] // we ignore these warnings, because the title is not our final title. [19:39:09] (03CR) 10Dzahn: "the puppet part looks fine, as long as running purgeUnusedProjects.php itself is harmless on the cluster" [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [19:39:10] ^ I see it [19:40:02] (03CR) 10Dzahn: "maybe change the commit message. now it says "enable" but doesn't actually enable." [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [19:40:18] ostriches: i don't. do you want to see the curl -I? [19:40:39] i'm in europe, so i could be seeing something else than you [19:40:56] oooh wait, there it is. finally [19:41:06] :) [19:41:30] (03CR) 10Dzahn: [C: 04-1] "i think we aren't ready for this yet, as it broke when we tried last time, and meanwhile we have rsynced repos over instead" [puppet] - 10https://gerrit.wikimedia.org/r/324851 (https://phabricator.wikimedia.org/T137928) (owner: 1020after4) [19:41:33] (03CR) 1020after4: [C: 031] "I agree that it isn't ideal, however, it's required in order for phabricator repository clustering. The reason this is desired is that clu" [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4) [19:41:33] so… we cache .js files for extra five minutes? grumble [19:41:36] ostriches: thanks <3 [19:42:32] yw [19:42:41] (03CR) 1020after4: [C: 04-1] "Yeah this is still waiting for me to deploy upstream changes and then it still needs some more thorough testing." [puppet] - 10https://gerrit.wikimedia.org/r/324851 (https://phabricator.wikimedia.org/T137928) (owner: 1020after4) [19:43:01] mutante: I've been periodically running the script manually from terbium, so it's well tested :) [19:46:08] kaldari: ok, fair enough :) doing! [19:46:15] (03CR) 10Dzahn: [C: 032] Add cron job for PageAssessments maintenance script to puppet [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [19:46:29] (03PS7) 10Dzahn: Add cron job for PageAssessments maintenance script to puppet [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [19:47:03] mutante: Thanks! [19:47:38] it's also quite fast [19:48:10] (03PS1) 10Urbanecm: Enable import from cswiki to arbcom_cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330983 (https://phabricator.wikimedia.org/T154799) [19:49:43] kaldari: we need to add a line in Hiera to make sure it's only enabled in eqiad, and not both codfw and eqiad [19:50:05] eh... or that's what i thought [19:50:24] double checks [19:54:01] (03PS6) 10Aaron Schulz: Add DB "shard" column to logstash log entries for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 [19:54:20] (03PS7) 10Aaron Schulz: Add DB "shard" column to logstash log entries for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 [19:54:51] (03PS1) 10Dzahn: mediawiki: activate page assessments cron in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/330984 [19:54:59] kaldari: ^ it was missing that [19:55:48] (03PS2) 10Dzahn: mediawiki: activate page assessments cron in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/330984 [19:56:24] (03CR) 10Dzahn: [C: 032] mediawiki: activate page assessments cron in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/330984 (owner: 10Dzahn) [19:56:55] Thanks! [19:56:57] PROBLEM - nova-api http on labnet1002 is CRITICAL: connect to address 10.64.20.25 and port 8774: Connection refused [19:59:47] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:59:52] kaldari: done, exists on terbium now (but not on wasat) [20:00:41] mutante: what's wasat? [20:00:53] (03PS5) 10Chad: Remove MWVersion, fold its two functions into MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 [20:00:55] (03PS1) 10Chad: Remove w/MWVersion.php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330986 [20:01:01] kaldari: the maintenance server in codfw. if the datacenters switch over that replaces terbium [20:01:15] in that case we'll be able to just make the switch in Hiera [20:01:28] that will deactivate the cron in eqiad, and activate it in codfw [20:01:39] got it [20:01:42] (same with all the other maint crons) [20:02:26] Krinkle: https://gerrit.wikimedia.org/r/#/c/330612/ [20:02:35] (03PS3) 10Chad: MWMultiversion cleanups [puppet] - 10https://gerrit.wikimedia.org/r/309366 [20:02:42] (03PS3) 10Thcipriani: Include hhvm fatals and exceptions in scap canary checks [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784) [20:03:47] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [20:04:25] the puppet spam because the lshw typo is over. lunch break [20:04:57] (03CR) 10jerkins-bot: [V: 04-1] Include hhvm fatals and exceptions in scap canary checks [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784) (owner: 10Thcipriani) [20:06:45] (03CR) 10Chad: [C: 032] Remove w/MWVersion.php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330986 (owner: 10Chad) [20:07:16] (03Merged) 10jenkins-bot: Remove w/MWVersion.php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330986 (owner: 10Chad) [20:07:37] (03CR) 10jenkins-bot: Remove w/MWVersion.php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330986 (owner: 10Chad) [20:09:07] PROBLEM - nova-api http on labnet1001 is CRITICAL: connect to address 10.64.20.13 and port 8774: Connection refused [20:09:43] (03PS4) 10Thcipriani: Include hhvm fatals and exceptions in scap canary checks [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784) [20:11:07] RECOVERY - nova-api http on labnet1001 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.003 second response time [20:12:25] (03PS1) 10Andrew Bogott: Nova: Fix syntax mistake in nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/330996 [20:12:27] (03CR) 10ArielGlenn: [C: 031] "The changes to runphpscriptlet lgtm though I have not tested them. It's a script I recently refactored away from but don't want to toss ju" [puppet] - 10https://gerrit.wikimedia.org/r/309366 (owner: 10Chad) [20:14:18] !log demon@tin Synchronized w: Dropping old entry point (duration: 00m 41s) [20:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:37] (03CR) 10Andrew Bogott: [C: 032] Nova: Fix syntax mistake in nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/330996 (owner: 10Andrew Bogott) [20:25:27] (03CR) 1020after4: [C: 031] Include hhvm fatals and exceptions in scap canary checks [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784) (owner: 10Thcipriani) [20:25:57] RECOVERY - nova-api http on labnet1002 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.003 second response time [20:33:51] (03PS1) 10Chad: Fold updateBranchPointers into updateWikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330998 [20:34:09] (03PS2) 10Chad: Fold updateBranchPointers into updateWikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330998 [20:37:19] https://commons.wikimedia.org/wiki/File:Ka%C5%A1elj_-_Rudnik_via_Orle.webm <- this is why the video sclers were breaking, lol (an example) [20:39:57] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [20:42:11] 06Operations, 10Traffic: Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801#2924351 (10ema) [20:42:16] 06Operations, 10Traffic: Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801#2924367 (10ema) p:05Triage>03Normal [20:45:27] RECOVERY - nova-api http on labtestnet2001 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.083 second response time [20:46:20] (03PS4) 10Tim Landscheidt: mwyaml: Accept existing, but empty "Hiera:" pages as well [puppet] - 10https://gerrit.wikimedia.org/r/325131 (https://phabricator.wikimedia.org/T152142) [20:46:35] (03CR) 1020after4: [C: 031] Deploy scholarships with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) (owner: 10Niharika29) [21:03:47] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [21:07:57] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:28:21] (03PS1) 10Andrew Bogott: Shinkengen: Get project hosts from openstack and not from ldap. [puppet] - 10https://gerrit.wikimedia.org/r/331005 (https://phabricator.wikimedia.org/T108625) [21:29:11] (03CR) 10jerkins-bot: [V: 04-1] Shinkengen: Get project hosts from openstack and not from ldap. [puppet] - 10https://gerrit.wikimedia.org/r/331005 (https://phabricator.wikimedia.org/T108625) (owner: 10Andrew Bogott) [22:09:57] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:12:00] (03CR) 10Krinkle: [C: 031] Fold updateBranchPointers into updateWikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330998 (owner: 10Chad) [22:13:36] (03CR) 10Krinkle: "Perhaps keep MWVersion.php/getMediaWiki as one-line wrapper for at least one commit so that we can do a full Git search for any mentions a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 (owner: 10Chad) [22:15:49] hey, ever tried changing the theme in gerrit diff ? [22:16:18] (03CR) 10Krinkle: "Looks like this is included from files like wiki.phtml using a relative ./ include, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330986 (owner: 10Chad) [22:16:26] while looking at a diff in gerrit, upper right corner, gear icon, "Diff preferences" [22:17:03] (03CR) 10Krinkle: "nvm, git search has an outdated index :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330986 (owner: 10Chad) [22:19:16] (03CR) 10Krinkle: [C: 031] docroots: Swap wikidata for wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/330709 (owner: 10Chad) [22:20:19] (03PS2) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) [22:25:15] (03PS3) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) [22:27:49] (03CR) 10Umherirrender: [C: 031] "the only one to change at the moment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328482 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [22:28:09] (03PS4) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) [22:30:12] (03PS5) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) [22:35:43] Krinkle: You're right re: leaving a stub for a bit in MWVersion before deleting. [22:35:48] Don't think I'll rush it out on a friday though [22:36:11] (03CR) 10Jcrespo: [C: 031] "I have not checked the code, but ok with the idea. Please double check authentication after apache changes." [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn) [22:36:13] (03CR) 10Chad: [C: 032] Fold updateBranchPointers into updateWikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330998 (owner: 10Chad) [22:36:50] (03Merged) 10jenkins-bot: Fold updateBranchPointers into updateWikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330998 (owner: 10Chad) [22:37:52] (03CR) 10jenkins-bot: Fold updateBranchPointers into updateWikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330998 (owner: 10Chad) [22:38:19] !log demon@tin Synchronized multiversion: updateBranchPointers consolidation (duration: 00m 56s) [22:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:57] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [22:42:16] (03PS6) 10Chad: Remove MWVersion, fold its two functions into MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 [22:42:28] (03PS1) 10Filippo Giunchedi: Revert "swift: move thumbs handling to codfw temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/331084 [22:42:31] (03PS1) 10Dzahn: ganglia: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/331085 (https://phabricator.wikimedia.org/T133717) [22:43:08] (03PS7) 10Paladox: Gerrit: Add support for logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) [22:43:16] (03PS8) 10Paladox: Gerrit: Add support for logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) [22:43:27] (03CR) 10Paladox: "> maybe change the commit message. now it says "enable" but doesn't" [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [22:43:47] (03CR) 10Chad: "PS6 keeps MWVersion as a light wrapper for now so we can do a last check :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 (owner: 10Chad) [22:44:09] (03CR) 10Paladox: "Ok. :)" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [22:46:07] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2924787 (10Paladox) @jcrespo or @Marostegui Hi, i have tested this https://gerrit.wikimedia.org/r/#/c/330455/ locally. Could you... [22:54:04] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2924820 (10Deskana) [22:54:06] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2924819 (10Deskana) 05Open>03Resolved [22:56:45] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2344551 (10Paladox) Was the elastic search plugin updated on the install? Might have been because... [22:57:36] (03PS6) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) [22:57:43] (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [23:06:13] (03CR) 10BBlack: [C: 031] Revert "swift: move thumbs handling to codfw temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/331084 (owner: 10Filippo Giunchedi) [23:07:10] (03CR) 10Filippo Giunchedi: [C: 032] Revert "swift: move thumbs handling to codfw temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/331084 (owner: 10Filippo Giunchedi) [23:08:11] !log force puppet run on cache_upload in eqiad to switch thumbs back from codfw [23:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:37] Hi guys, quick question. [23:23:07] And ideas as to who's responsible for handling domain transfers on the WMF side? [23:23:32] Also who's responsible for handling that on MarkMonitor side? Is it still Doni Daggett? [23:24:48] (03CR) 10Hashar: "Note that the experimental job runs on a Trusty machine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [23:25:30] mutante: any ideas? [23:25:44] (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [23:26:13] odder: best bet is to fill a task for #operations and #dns ? :) [23:26:39] (03CR) 10Krinkle: "It's on nodepool/jessie now. submodules are expanded. vendor non-dev is preserved. Uses same fetch-composer-dev logic as we do for mediawi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [23:27:37] hashar: sure, just managed to get it from a squatter so \o/! victory! [23:29:21] 06Operations, 10DNS, 10Traffic: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#2924998 (10tomasz) [23:29:23] (03CR) 10Hashar: [C: 031] "That is quite nice! Well done Timo :]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [23:29:26] https://phabricator.wikimedia.org/T154826 hashar [23:29:59] 06Operations, 10DNS, 10Domains, 10Traffic: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#2925024 (10hashar) [23:30:00] odder: I cant remember all the details [23:30:38] odder: but I think WMF avoid buying every possible domains because it has a cost (the registration fee) and is a lot of operational work (dns entries, web park/redirect) etc [23:30:44] It used to be Yana / Doni Daggett from MarkMonitor when I last bought & donated a domain to the WMF. [23:30:59] hashar: it's a TLD that got squatted in 200x. [23:31:21] I have been meaning to buy for years now; they only forgot to renew it this year. [23:31:24] I am just mumbling hearsays :D [23:31:50] ops or legal would be able to tell eventually [23:32:10] Well I did pay for it for the first year so that's done :) [23:32:23] \O/ [23:32:49] (03PS1) 10Krinkle: noc: Implement noc.wikimedia.org/db.php?format=json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331091 [23:34:06] odder: there is wikipedia.pl as well registered to Stowarzyszenie Wikimedia Polska. Maybe that is the local chapter [23:35:15] They are the local chapter, however they do not wish to buy domain names for WMF trademarks which are not the Polish names of their respective projects. [23:35:39] So Wiktionary is called "Wikisłownik" in Polish and they didn't want to buy the English trademark as it's not a Polish name of the project. [23:36:04] Same with Wikiquote.pl which I also snatched from a squatter and donated to WMF in 2015 (?) I think. [23:36:16] that kind of make sense yeah [23:36:29] We're almost there, just one squatted domain to go. [23:36:44] I've been on a mission to catch them all for a while :) [23:37:02] when you get the last polish domain [23:37:14] I guess we should have a party :D [23:37:20] Expires 2017-04-23, but they're damn professionals, so we'll see. [23:37:34] Or maybe I'll just punch them in the face when I walk past their offices, we'll see. [23:37:53] they've held that one since 2008 :-( [23:38:08] well they will disappear long before Wikimedia does :] [23:38:13] time is on our side! [23:38:31] on those good words, I am heading to bed. Have a good week-end! [23:38:40] You too; thanks! [23:38:46] (03CR) 10Jcrespo: "I had some of this already implemented; in particular the json, which unlike PHP, has not item order for hashes, so the master requires ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331091 (owner: 10Krinkle) [23:39:11] (03PS2) 10Krinkle: noc: Implement noc.wikimedia.org/db.php?format=json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331091 [23:40:45] 06Operations, 10DNS, 10Domains, 10Traffic: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#2924998 (10CRoslof) Send me an email and I can help you get the process started: croslof@wikimedia.org Please note that a domain generally cannot be transferred until 60 days after... [23:41:09] 06Operations, 10DNS, 10Domains, 10Traffic: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#2925039 (10CRoslof) a:03CRoslof [23:45:00] (03PS3) 10Krinkle: noc: Implement noc.wikimedia.org/db.php?format=json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331091 [23:45:57] (03PS7) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) [23:48:27] (03PS1) 10Krinkle: build: Update PHPUnit from 3.7 to 4.8, add phplint to composer-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331093 (https://phabricator.wikimedia.org/T85947) [23:48:36] (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331093 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [23:49:13] (03CR) 10jerkins-bot: [V: 04-1] build: Update PHPUnit from 3.7 to 4.8, add phplint to composer-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331093 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [23:49:17] (03CR) 10jerkins-bot: [V: 04-1] build: Update PHPUnit from 3.7 to 4.8, add phplint to composer-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331093 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [23:57:27] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues