[01:30:49] <icinga-wm>	 PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100%
[01:31:49] <godog>	 heka is fr IIRC
[01:31:59] <icinga-wm>	 PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 100%
[01:32:09] <icinga-wm>	 PROBLEM - Host payments2003 is DOWN: PING CRITICAL - Packet loss = 100%
[01:32:12] <icinga-wm>	 PROBLEM - Host payments2001 is DOWN: PING CRITICAL - Packet loss = 100%
[01:32:15] <godog>	 I smell codfw pfw bug
[01:32:59] <icinga-wm>	 PROBLEM - Host fdb2001 is DOWN: PING CRITICAL - Packet loss = 100%
[01:33:09] <icinga-wm>	 PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100%
[01:33:19] <icinga-wm>	 PROBLEM - Host rigel is DOWN: PING CRITICAL - Packet loss = 100%
[01:33:29] <icinga-wm>	 PROBLEM - Router interfaces on pfw-codfw is CRITICAL: CRITICAL: host 208.80.153.195, interfaces up: 55, down: 3, dormant: 0, excluded: 0, unused: 0BRvlan.2133: down - Subnet frack-bastion-codfwBRvlan.2140: down - Subnet frack-management-codfwBRvlan.2137: down - Subnet frack-listenerdmz-codfwBR
[01:33:39] <icinga-wm>	 PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100%
[01:33:41] <godog>	 no way of silencing it in batch, we'll get the pages
[01:35:26] <godog>	 this is T126790 btw
[01:37:03] <godog>	 well, I think so far, checking
[01:39:39] <icinga-wm>	 PROBLEM - Juniper alarms on pfw-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.195
[01:40:19] <icinga-wm>	 RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 36.48 ms
[01:40:22] <icinga-wm>	 RECOVERY - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 36.42 ms
[01:40:24] <icinga-wm>	 RECOVERY - Host fdb2001 is UP: PING OK - Packet loss = 0%, RTA = 36.49 ms
[01:40:27] <icinga-wm>	 RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 36.45 ms
[01:40:30] <icinga-wm>	 RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 36.48 ms
[01:40:32] <icinga-wm>	 RECOVERY - Host rigel is UP: PING OK - Packet loss = 0%, RTA = 36.53 ms
[01:40:35] <icinga-wm>	 RECOVERY - Host payments2001 is UP: PING OK - Packet loss = 0%, RTA = 36.40 ms
[01:40:38] <icinga-wm>	 RECOVERY - Host payments2003 is UP: PING OK - Packet loss = 0%, RTA = 36.42 ms
[01:40:51] <icinga-wm>	 RECOVERY - Juniper alarms on pfw-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[01:40:51] <icinga-wm>	 RECOVERY - Router interfaces on pfw-codfw is OK: OK: host 208.80.153.195, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0
[01:50:11] <icinga-wm>	 PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 4 failures
[01:50:11] <icinga-wm>	 PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 9 failures
[01:55:11] <icinga-wm>	 RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures
[01:55:11] <icinga-wm>	 RECOVERY - check_puppetrun on fdb2001 is OK: OK: Puppet is currently enabled, last run 201 seconds ago with 0 failures
[02:12:12] <wikibugs_>	 06Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T151602#2822419 (10Sleepinglion)
[02:18:20] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.3) (duration: 06m 48s)
[02:18:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:22:40] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Nov 25 02:22:40 UTC 2016 (duration 4m 20s)
[02:22:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:23:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 669.09 seconds
[03:29:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 228.18 seconds
[04:18:41] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3492.60 Read Requests/Sec=6262.00 Write Requests/Sec=7.00 KBytes Read/Sec=25094.80 KBytes_Written/Sec=3245.20
[04:25:41] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=205.30 Read Requests/Sec=317.40 Write Requests/Sec=1.00 KBytes Read/Sec=3244.00 KBytes_Written/Sec=326.40
[05:24:51] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 23 failures. Last run 2 minutes ago with 23 failures. Failed resources (up to 3 shown): Package[wipe],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service],Package[zsh-beta]
[05:47:21] <icinga-wm>	 PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:55:51] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:15:21] <icinga-wm>	 RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[06:23:51] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[07:15:15] <grrrit-wm>	 (03PS1) 10Marostegui: site.pp: db1044's binlog changed to ROW [puppet] - 10https://gerrit.wikimedia.org/r/323504 (https://phabricator.wikimedia.org/T150802) 
[07:19:03] <grrrit-wm>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4672/" [puppet] - 10https://gerrit.wikimedia.org/r/323504 (https://phabricator.wikimedia.org/T150802) (owner: 10Marostegui) 
[07:22:08] <grrrit-wm>	 (03PS1) 10Marostegui: db-eqiad.php: Added comment for db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323505 (https://phabricator.wikimedia.org/T150802) 
[07:26:53] <wikibugs_>	 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2822534 (10Joe) @aaron why can't we just do something smarter and have a global throttling on a specific job type don...
[07:40:51] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] site.pp: db1044's binlog changed to ROW [puppet] - 10https://gerrit.wikimedia.org/r/323504 (https://phabricator.wikimedia.org/T150802) (owner: 10Marostegui) 
[07:41:48] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Added comment for db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323505 (https://phabricator.wikimedia.org/T150802) (owner: 10Marostegui) 
[07:43:42] <grrrit-wm>	 (03CR) 10Jcrespo: site.pp: db1044's binlog changed to ROW (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323504 (https://phabricator.wikimedia.org/T150802) (owner: 10Marostegui) 
[07:45:07] <grrrit-wm>	 (03PS2) 10Marostegui: site.pp: Change db1044 binlog to ROW [puppet] - 10https://gerrit.wikimedia.org/r/323504 (https://phabricator.wikimedia.org/T150802) 
[07:51:45] <marostegui>	 !log Stopping replication db1052 for maintenance - T151607
[07:51:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:57] <stashbot>	 T151607: Rebuild old timestamp format tables - https://phabricator.wikimedia.org/T151607
[08:06:00] <grrrit-wm>	 (03CR) 10Nikerabbit: [C: 04-1] Explicitly set cookieDomain for ContentTranslationSiteTemplates (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320200 (https://phabricator.wikimedia.org/T149879) (owner: 10KartikMistry) 
[08:14:17] <botwanker>	 jynus: Your mother is available, for sex?
[08:17:24] <botwanker>	 _joe_: Your mother is available, for sex?
[08:18:00] <_joe_>	 botwanker: she has black friday offers up, yes, you should run. I'm sure that would make you feel important too
[08:18:30] <botwanker>	 @_joe_: Your mother is available, for fucking in the ass?
[08:18:54] <botwanker>	 but in a more serious topic
[08:19:07] <botwanker>	 I am fucking up all the wikimedia shit
[08:19:09] <botwanker>	 all the chans
[08:19:22] <botwanker>	 all of them, none shall be spared the wrath of me!
[08:19:48] * _joe_ shrugs
[08:20:11] <botwanker>	 hashar: Your mother is available, for sex?
[08:34:15] <elukey>	 sigh
[08:34:41] <elukey>	 Why people don't have nothing better to do than doing these things
[08:35:49] <hashar>	 bored / having fun
[08:35:54] <hashar>	 welcome to the internet!
[08:37:06] <_joe_>	 p858snake|L2: there is a clear reason why I didn't ban him 
[08:37:20] <_joe_>	 it fuels his sense of self-importance
[08:37:35] <_joe_>	 I was kinda managing the situation
[08:38:57] <valhallasw`cloud>	 _joe_: sorry, only just saw your comment here now. I disagree with your assessment, but I should have discussed it here beforehand.
[08:47:03] <_joe_>	 valhallasw`cloud: it's ok :)
[08:52:16] <elukey>	 !log restarting Yarn and HDFS masters on analytics100[12] (Hadoop cluster) to complete the openjdk update
[08:52:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:53] <volans>	 !log restarting zotero on sca1003, almost out of RAM, puppet failing
[08:53:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:05] <_joe_>	 !log uploading hhvm_3.12.7+dfsg-1+wmf4 to apt
[08:58:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:51] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[09:18:04] <wikibugs_>	 06Operations, 10MediaWiki-Cache, 10Traffic: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#2822732 (10hashar)
[09:19:40] <wikibugs>	 06Operations, 10MediaWiki-Cache, 10Traffic: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#2822737 (10hashar) I guess #traffic is sufficient. I have filled this task as potential material to look at high rate of cache purges which apparently is or will be an issue. N...
[09:23:28] <_joe_>	 !log upgraded hhvm on the debug hosts
[09:23:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:57] <wikibugs>	 06Operations, 10MediaWiki-Cache, 10Traffic: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#2822747 (10hashar) Looks like the CdnPurgeJob are intentionally NOT deduplicated! ``` name=includes/jobqueue/jobs/CdnPurgeJob.php, lang=php class CdnPurgeJob extends Job {...
[09:33:01] <wikibugs>	 06Operations, 10Traffic: Varnishkafka and related VSM daemons seeing abandoned VSM logs - https://phabricator.wikimedia.org/T151563#2822762 (10elukey) I checked some hosts showing the same behavior as cp1055 and the type of request that causes the assert failure is always the same:  ```  /w/api.php?action=quer...
[09:44:28] <wikibugs_>	 06Operations, 07Upstream: Trusty: debug information found in "/usr/lib/debug//usr/lib/php5/20121212/mysql.so" does not match "/usr/lib/php5/20121212/mysql.so" (CRC mismatch). - https://phabricator.wikimedia.org/T145706#2822771 (10hashar) 05Open>03declined Step to reproduce: ``` $ gdb /usr/lib/php5/20121212...
[09:46:45] <wikibugs>	 06Operations: Puppet CA rollover - https://phabricator.wikimedia.org/T150823#2797837 (10Volans) And also of T111654 for MariaDB's usage of this CA. Adding #dba
[09:56:10] <wikibugs_>	 06Operations: Puppet CA rollover - https://phabricator.wikimedia.org/T150823#2797837 (10Joe) There are other systems using this CA, namely etcd.  I would suggest the right course of action for all apps is:  # create a new CA key/cert # Install it as a trusted CA on all live systems # Audit how applications use t...
[10:10:23] <wikibugs>	 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2822790 (10elukey) 05Open>03stalled
[10:10:35] <wikibugs_>	 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2288419 (10elukey) This task is currently blocked by T137345
[10:13:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 622.14 seconds
[10:14:14] <marostegui>	 ^ checking
[10:14:39] <jynus>	 there is no long running queries
[10:15:26] <jynus>	 marostegui, db1047 is an analytics, non-core server
[10:15:56] <marostegui>	 ah fine
[10:16:04] <jynus>	 io reads has skyrocketed
[10:16:05] <elukey>	 bad analytics people causing trouble
[10:16:10] <marostegui>	 XD
[10:19:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 0.11 seconds
[10:20:14] <_joe_>	 !log upgrading HHVM across codfw
[10:20:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:29] <wikibugs_>	 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps, and 2 others: Requesting access to analytics-privatedata-users for technical user discovery-stats - https://phabricator.wikimedia.org/T151063#2822802 (10Gehel) 05Resolved>03Open >>! In T151063#2811510, @Ottomata wrote: > Users in the `stats` gro...
[10:34:25] <mafk>	 Hello. Can someone run a maintenance script (dry-run) on angwiki for me?
[10:40:40] <jynus>	 mafk, I do not think running such a script it ok in deployment freeze
[10:41:04] <jynus>	 if anything goes bad, nobody is here that will be able to help
[10:41:13] <mafk>	 jynus: dry-run is not the title of the script
[10:41:31] <mafk>	 it's for checking bad redirects in report-only mode, no changes made
[10:41:40] <mafk>	 but if can't be done, then ok
[10:42:17] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Add network default value to Grafana Server Board [puppet] - 10https://gerrit.wikimedia.org/r/323513 
[10:42:29] <jynus>	 mafk, precisely, the right people to help are not here
[10:42:47] <mafk>	 jynus: I understand
[10:43:16] <mafk>	 I'll note it on Phab for next week
[10:44:26] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Add network default value to Grafana Server Board [puppet] - 10https://gerrit.wikimedia.org/r/323513 (owner: 10Alexandros Kosiaris) 
[10:50:39] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Remove some more garbage from Grafana Server Board [puppet] - 10https://gerrit.wikimedia.org/r/323515 
[10:50:48] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove some more garbage from Grafana Server Board [puppet] - 10https://gerrit.wikimedia.org/r/323515 (owner: 10Alexandros Kosiaris) 
[11:04:53] <grrrit-wm>	 (03PS1) 10Elukey: Avoid Redis IPsec replication if the host doesn't need it. [puppet] - 10https://gerrit.wikimedia.org/r/323517 (https://phabricator.wikimedia.org/T137345) 
[11:05:19] <ema>	 !log uploaded libvmod-{netmapper,tbf,vslp} to carbon main component (T150660)
[11:05:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:30] <stashbot>	 T150660: Post Varnish 4 migration cleanup - https://phabricator.wikimedia.org/T150660
[11:05:51] <icinga-wm>	 PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:15:13] <wikibugs_>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2822968 (10Gilles)
[11:16:21] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[11:17:21] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3869934 keys, up 25 days 2 hours - replication_delay is 0
[11:22:17] <grrrit-wm>	 (03PS2) 10Elukey: Avoid Redis IPsec replication if the host doesn't need it. [puppet] - 10https://gerrit.wikimedia.org/r/323517 (https://phabricator.wikimedia.org/T137345) 
[11:33:51] <icinga-wm>	 RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[11:50:13] <wikibugs_>	 06Operations, 06Performance-Team, 10Thumbor: Investigate source of thumbnail 302 redirects - https://phabricator.wikimedia.org/T148410#2822991 (10Gilles) @fgiunchedi do I have access to any of the varnish production frontends?  I'd like to run something like this: `varnishlog -q "RespStatus eq 302" -i ReqHea...
[11:56:18] <wikibugs_>	 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#2823011 (10Marostegui) p:05Triage>03Normal
[11:57:01] <wikibugs>	 06Operations, 10ops-codfw, 10ops-ulsfo: cp4008 power supply failure - https://phabricator.wikimedia.org/T151275#2823013 (10Volans) p:05Triage>03Normal
[11:57:53] <wikibugs_>	 06Operations, 10DBA, 13Patch-For-Review: db1092 crash - https://phabricator.wikimedia.org/T151272#2823014 (10Marostegui) p:05Triage>03Normal
[11:59:25] <wikibugs>	 06Operations, 07Documentation: Proper documentation for Yubico 2FA for production use - https://phabricator.wikimedia.org/T151050#2823015 (10Volans) p:05Triage>03Low
[11:59:52] <wikibugs_>	 06Operations: Run systematic availability tests - https://phabricator.wikimedia.org/T151049#2823016 (10Volans) p:05Triage>03Normal
[12:00:07] <wikibugs>	 06Operations: Icinga monitoring for Yubikey components - https://phabricator.wikimedia.org/T151048#2823044 (10Volans) p:05Triage>03Normal
[12:00:26] <wikibugs_>	 06Operations: Integrate Yubikey into data.yaml - https://phabricator.wikimedia.org/T151047#2823045 (10Volans) p:05Triage>03Normal
[12:00:50] <wikibugs>	 06Operations: Fully puppetise yubikey-val - https://phabricator.wikimedia.org/T151046#2823046 (10Volans) p:05Triage>03Normal
[12:01:04] <wikibugs_>	 06Operations: Extending Yubico 2FA for production use (meta bug) - https://phabricator.wikimedia.org/T151045#2823047 (10Volans) p:05Triage>03Normal
[12:02:06] <wikibugs>	 06Operations: Internal PKI for secure communication - Barcelona Ops offsite 2016 - https://phabricator.wikimedia.org/T150822#2823048 (10Volans) p:05Triage>03Normal
[12:04:11] <icinga-wm>	 PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:15:02] <wikibugs_>	 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2758792 (10Volans) This host was left in scheduled downtime and it was scheduled for a long downtime. I've removed the downtime to the host and all services.
[12:17:25] <wikibugs>	 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure, 10Monitoring: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#2823158 (10Volans) 05Resolved>03Open `labtestcontrol2001` host and many of it's services are in scheduled downtime on Icinga. If this was...
[12:23:17] <wikibugs>	 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure, 10Monitoring: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#2823179 (10Volans) Similar situation of scheduled downtime for the host and some of the services also for: - `labtestnet2001`: all ok - `labte...
[12:32:11] <icinga-wm>	 RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[12:44:00] <wikibugs>	 06Operations: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#2823217 (10Volans)
[12:54:47] <grrrit-wm>	 (03PS1) 10Jcrespo: [WIP] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) 
[13:02:27] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) 
[13:27:00] <wikibugs>	 06Operations, 10Traffic: Varnishkafka and related VSM daemons seeing abandoned VSM logs - https://phabricator.wikimedia.org/T151563#2823275 (10elukey) p:05Triage>03Normal
[13:31:38] <wikibugs_>	 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2610832 (10Reedy) Ok, so what username are you using on wikitech? Is it Shoichi too? I notice that isn't linked to your Wikitech account, but is to your normal wiki account...
[13:40:27] <grrrit-wm>	 (03PS1) 10Elukey: Remove cron notifications to root@ for jobchron/runner service status [puppet] - 10https://gerrit.wikimedia.org/r/323528 (https://phabricator.wikimedia.org/T132324) 
[13:54:54] <wikibugs>	 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2823313 (10Shoichi) >>! In T144805#2823276, @Reedy wrote: > Ok, so what username are you using on wikitech? Is it Shoichi too? I notice that isn't linked to your Wikitech account, but is to your n...
[14:16:18] <grrrit-wm>	 (03PS1) 10Ema: varnishapi.py: add VSM_Close C binding [puppet] - 10https://gerrit.wikimedia.org/r/323530 (https://phabricator.wikimedia.org/T151561) 
[14:16:20] <Reedy>	 !log delete oathauth row on wikitech for user Shoichi per T144805
[14:16:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:31] <stashbot>	 T144805: Can't login wikitech - https://phabricator.wikimedia.org/T144805
[14:19:17] <wikibugs_>	 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2823372 (10Reedy) This has now been done :)
[14:20:11] <icinga-wm>	 PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:20:37] <wikibugs>	 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2823375 (10Liuxinyu970226) >>! In T144805#2823372, @Reedy wrote: > This has now been done :)  me too...
[14:22:17] <Reedy>	 !log delete oathauth row on wikitech for user Liuxinyu970226 per T144805
[14:22:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:29] <stashbot>	 T144805: Can't login wikitech - https://phabricator.wikimedia.org/T144805
[14:28:59] <wikibugs_>	 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2823403 (10Reedy) 05Open>03Resolved a:03Reedy
[14:48:45] <grrrit-wm>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/323528 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) 
[14:48:49] <grrrit-wm>	 (03PS3) 10Ema: varnish: rename scripts depending on varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/323423 (https://phabricator.wikimedia.org/T150660) 
[14:48:56] <wikibugs>	 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2823419 (10faidon) This has been opened with JTAC as case [[ https://casemanager.juniper.net/casemanager/#/cmdetails/2016-1125-0413 | 2016-1125-0413 ]].
[14:48:59] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] varnish: rename scripts depending on varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/323423 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) 
[14:49:11] <icinga-wm>	 RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[14:55:30] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] "LGTM but please test it after deploying it, too critical to fail." [puppet] - 10https://gerrit.wikimedia.org/r/322362 (owner: 10Volans) 
[14:58:07] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] add additional information on malformed responses [software/service-checker] - 10https://gerrit.wikimedia.org/r/321714 (https://phabricator.wikimedia.org/T150560) (owner: 10Volans) 
[14:58:36] <FlorianSW>	 Amir1: ping
[15:03:04] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "This is nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/322249 (https://phabricator.wikimedia.org/T151043) (owner: 10Volans) 
[15:09:42] <wikibugs_>	 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2823445 (10matmarex) @joe's runJobs.php run has brought the queue back to normal levels, but it is growing again at a...
[15:11:06] <wikibugs_>	 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2823448 (10Joe) @matmarex agreed, I was waiting for about one day of data to accrue, but it's clearly a good idea to...
[15:18:19] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Remove cron notifications to root@ for jobchron/runner service status [puppet] - 10https://gerrit.wikimedia.org/r/323528 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) 
[15:18:23] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: Remove cron notifications to root@ for jobchron/runner service status [puppet] - 10https://gerrit.wikimedia.org/r/323528 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) 
[15:21:01] <icinga-wm>	 PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:21:44] <grrrit-wm>	 (03CR) 10Elukey: [C: 031] varnishapi.py: add VSM_Close C binding [puppet] - 10https://gerrit.wikimedia.org/r/323530 (https://phabricator.wikimedia.org/T151561) (owner: 10Ema) 
[15:23:02] <grrrit-wm>	 (03PS2) 10Ema: varnishapi.py: add VSM_Close C binding [puppet] - 10https://gerrit.wikimedia.org/r/323530 (https://phabricator.wikimedia.org/T151561) 
[15:23:05] <grrrit-wm>	 (03CR) 10Elukey: "Maybe we could wait for an official release from upstream to bump the varniapi.py version number" [puppet] - 10https://gerrit.wikimedia.org/r/323530 (https://phabricator.wikimedia.org/T151561) (owner: 10Ema) 
[15:23:11] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] varnishapi.py: add VSM_Close C binding [puppet] - 10https://gerrit.wikimedia.org/r/323530 (https://phabricator.wikimedia.org/T151561) (owner: 10Ema) 
[15:23:12] <Amir1>	 FlorianSW: I'm just back
[15:23:14] <Amir1>	 what's up
[15:24:02] <FlorianSW>	 Amir1: hi :) I already wrote my question in phabricator: It seems you're not registered in the Google code-in platform, right?
[15:24:25] <Amir1>	 Yes, Can I talk to you privately?
[15:25:45] <FlorianSW>	 sure :)
[15:33:49] <andre__>	 FlorianSW, Amir1: Just FYI I'm also around (and still need to reply to Amir1's email, sigh...)
[15:33:56] <andre__>	 (and probably wrong channel here)
[15:35:17] <Amir1>	 o/
[15:43:11] <grrrit-wm>	 (03PS1) 10Hashar: Shinken wmflabs: remove Chris McMahon [puppet] - 10https://gerrit.wikimedia.org/r/323536 
[15:43:23] <wikibugs>	 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2823470 (10zhuyifei1999) Um.. @shizhao?
[15:44:58] <grrrit-wm>	 (03PS1) 10Hashar: Remove Antoine from beta cluster notifications [puppet] - 10https://gerrit.wikimedia.org/r/323537 
[15:49:01] <icinga-wm>	 RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[15:51:05] <grrrit-wm>	 (03CR) 10Aude: [C: 031] Use entity types for the repoNamespaces Wikibase client setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323347 (owner: 10Hoo man) 
[15:52:22] <wikibugs_>	 06Operations, 10Traffic: Varnishkafka seeing abandoned VSM logs - https://phabricator.wikimedia.org/T151563#2823500 (10elukey)
[15:58:22] <wikibugs_>	 06Operations, 10Traffic: varnishlog daemons seeing Log overrun constantly - https://phabricator.wikimedia.org/T151643#2823504 (10elukey)
[16:08:22] <grrrit-wm>	 (03CR) 10Alex Monk: "I think you forgot to delete the old file" [puppet] - 10https://gerrit.wikimedia.org/r/319384 (owner: 10Rush) 
[16:09:21] <wikibugs_>	 06Operations, 10Traffic: varnishlog daemons seeing Log overrun constantly - https://phabricator.wikimedia.org/T151643#2823522 (10elukey) p:05Triage>03High
[16:19:33] <wikibugs_>	 06Operations, 10Traffic: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#2823540 (10elukey)
[16:35:23] <grrrit-wm>	 (03PS1) 10Chad: gerrit (2.13.3-wmf.1) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/323545 (https://phabricator.wikimedia.org/T146350) 
[16:47:41] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:48:41] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[16:53:41] <grrrit-wm>	 (03PS4) 10Volans: RAID: get RAID status improvement for MegaCLI [puppet] - 10https://gerrit.wikimedia.org/r/322249 (https://phabricator.wikimedia.org/T151043) 
[17:15:00] <jynus>	 !log drop database vewikimedia (deleted wiki) from sanitarium and its slaves
[17:15:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:04] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] gerrit (2.13.3-wmf.1) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/323545 (https://phabricator.wikimedia.org/T146350) (owner: 10Chad) 
[17:23:48] <grrrit-wm>	 (03PS1) 10Mobrovac: RESTBase: Add the PDF Render service config [puppet] - 10https://gerrit.wikimedia.org/r/323548 
[17:28:44] <grrrit-wm>	 (03PS2) 10Mobrovac: RESTBase: Add the PDF Render service config [puppet] - 10https://gerrit.wikimedia.org/r/323548 
[17:32:21] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:32:31] <icinga-wm>	 PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:33:43] <Zppix>	 is tools-labs servers operating okay it appars one of them may  be needing a restart or something.
[17:34:34] <Krenair>	 Zppix, which one?
[17:34:54] <Zppix>	 Krenair the one whom earwig's copyvio tool is running on
[17:35:07] <Krenair>	 got a link?
[17:35:12] <Zppix>	 to ?
[17:35:18] <Krenair>	 the tool
[17:35:22] <Zppix>	 https://tools.wmflabs.org/copyvios
[17:36:00] <Zppix>	 208.80.155.131:80 Krenair if that helps
[17:36:17] <Krenair>	 it doesn't really
[17:36:30] <Krenair>	 so what's wrong with the tool?
[17:36:32] <Zppix>	 ok... i wish i could tell you the exact server(s)
[17:37:11] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[17:37:17] <Krenair>	 I know which server it's running on
[17:37:21] <icinga-wm>	 RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient
[17:38:24] <grrrit-wm>	 (03PS2) 10Jcrespo: [WIP] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) 
[17:38:26] <Krenair>	 but I don't see anything wrong with the tool Zppix
[17:39:05] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) 
[17:39:05] <Zppix>	 Krenair it wasnt loading at its normal rate earlier
[17:39:22] <Zppix>	 im just alerting someone so it doesnt come back to bite someone in the ass later (pardon my french)
[17:40:08] <Krenair>	 well from what I can tell the server it's running on is fine
[17:40:34] <Zppix>	 Krenair okay. I just wanted to let someone know incase it was a server issue.
[17:40:43] <Krenair>	 okay
[17:40:56] <Zppix>	 If it does it again I will let you know
[17:41:46] <Krenair>	 the proxy in front of tools looks okay too
[17:41:56] <Krenair>	 ok
[17:45:31] <Zppix>	 Krenair for future reference what is the server name, so I dont need to have someone go digging.
[17:45:51] <Krenair>	 it's on the tools grid, so it can change
[17:46:08] <Zppix>	 ah, ok
[17:46:21] <Zppix>	 surpised its not kubectl
[17:52:53] <Krenair>	 I'm not really
[17:53:16] <Krenair>	 I don't know if we have stats on the use of each system but the grid is still very much in active use
[17:54:57] <Zppix>	 everything on my tool is on kubectl (my webservice and my bot)
[18:02:55] <wikibugs>	 06Operations, 10Gerrit, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2823628 (10Paladox) I found this https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic
[18:07:22] <wikibugs_>	 06Operations, 10media-storage: Consider storage policies for swift - https://phabricator.wikimedia.org/T151648#2823630 (10fgiunchedi)
[18:13:44] <grrrit-wm>	 (03CR) 10Ppchelko: [C: 031] RESTBase: Add the PDF Render service config [puppet] - 10https://gerrit.wikimedia.org/r/323548 (owner: 10Mobrovac) 
[18:16:11] <icinga-wm>	 PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:43:11] <icinga-wm>	 RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[18:51:01] <icinga-wm>	 PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:18:01] <icinga-wm>	 RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[19:26:02] <grrrit-wm>	 (03CR) 10Aklapper: "This can be ABANDONED as T151148 has been resolved." [puppet] - 10https://gerrit.wikimedia.org/r/322781 (https://phabricator.wikimedia.org/T151148) (owner: 1020after4) 
[19:43:31] <icinga-wm>	 PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:44:42] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2823701 (10fgiunchedi) The error should have been fixed upstream by https://github.com/jonnenauha/prometheus_varnish_ex...
[20:09:34] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki angwiki (T45917)
[20:09:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:45] <stashbot>	 T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917
[20:12:31] <icinga-wm>	 RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[20:20:12] <grrrit-wm>	 (03PS1) 10Urbanecm: [throttle] Exception for #MOWomenOnWikipedia Edit-A-Thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323555 (https://phabricator.wikimedia.org/T151650) 
[20:40:29] <grrrit-wm>	 (03PS1) 10Aude: Move interwiki sorting orders to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323556 (https://phabricator.wikimedia.org/T111023) 
[20:43:34] <grrrit-wm>	 (03PS2) 10Urbanecm: [throttle] Exception for #MOWomenOnWikipedia Edit-A-Thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323555 (https://phabricator.wikimedia.org/T151650) 
[20:45:33] <grrrit-wm>	 (03PS1) 10Urbanecm: [throttle] Remove old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323557 
[21:23:51] <icinga-wm>	 PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:25:28] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: prometheus: add vhtcpd stats via node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) 
[21:52:51] <icinga-wm>	 RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[22:00:31] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[22:01:21] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3915068 keys, up 25 days 13 hours - replication_delay is 0
[22:09:11] <icinga-wm>	 PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:13:07] <grrrit-wm>	 (03PS10) 10Paladox: Add gbp.conf file for debian [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 
[22:14:16] <grrrit-wm>	 (03PS11) 10Chad: Add gbp.conf file for debian [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 (owner: 10Paladox) 
[22:14:52] <grrrit-wm>	 (03CR) 10Paladox: "recheck" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 (owner: 10Paladox) 
[22:15:01] <grrrit-wm>	 (03CR) 10Chad: [C: 032] Add gbp.conf file for debian [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 (owner: 10Paladox) 
[22:15:05] <grrrit-wm>	 (03CR) 10Chad: [V: 032] Add gbp.conf file for debian [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 (owner: 10Paladox) 
[22:16:29] <grrrit-wm>	 (03PS2) 10Chad: gerrit (2.13.3-wmf.1) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/323545 (https://phabricator.wikimedia.org/T146350) 
[22:38:11] <icinga-wm>	 RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[22:59:01] <grrrit-wm>	 (03CR) 10Volans: "A couple of minor inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) (owner: 10Filippo Giunchedi) 
[23:01:21] <icinga-wm>	 PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:10:51] <icinga-wm>	 PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:28:33] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: prometheus: add vhtcpd stats via node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) 
[23:29:21] <icinga-wm>	 RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[23:31:42] <grrrit-wm>	 (03Abandoned) 1020after4: Allow aklapper to `sudo -E` phabricator admin utilities [puppet] - 10https://gerrit.wikimedia.org/r/322781 (https://phabricator.wikimedia.org/T151148) (owner: 1020after4) 
[23:38:51] <icinga-wm>	 RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures