[01:30:49] PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100% [01:31:49] heka is fr IIRC [01:31:59] PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 100% [01:32:09] PROBLEM - Host payments2003 is DOWN: PING CRITICAL - Packet loss = 100% [01:32:12] PROBLEM - Host payments2001 is DOWN: PING CRITICAL - Packet loss = 100% [01:32:15] I smell codfw pfw bug [01:32:59] PROBLEM - Host fdb2001 is DOWN: PING CRITICAL - Packet loss = 100% [01:33:09] PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [01:33:19] PROBLEM - Host rigel is DOWN: PING CRITICAL - Packet loss = 100% [01:33:29] PROBLEM - Router interfaces on pfw-codfw is CRITICAL: CRITICAL: host 208.80.153.195, interfaces up: 55, down: 3, dormant: 0, excluded: 0, unused: 0BRvlan.2133: down - Subnet frack-bastion-codfwBRvlan.2140: down - Subnet frack-management-codfwBRvlan.2137: down - Subnet frack-listenerdmz-codfwBR [01:33:39] PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100% [01:33:41] no way of silencing it in batch, we'll get the pages [01:35:26] this is T126790 btw [01:37:03] well, I think so far, checking [01:39:39] PROBLEM - Juniper alarms on pfw-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.195 [01:40:19] RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 36.48 ms [01:40:22] RECOVERY - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 36.42 ms [01:40:24] RECOVERY - Host fdb2001 is UP: PING OK - Packet loss = 0%, RTA = 36.49 ms [01:40:27] RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 36.45 ms [01:40:30] RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 36.48 ms [01:40:32] RECOVERY - Host rigel is UP: PING OK - Packet loss = 0%, RTA = 36.53 ms [01:40:35] RECOVERY - Host payments2001 is UP: PING OK - Packet loss = 0%, RTA = 36.40 ms [01:40:38] RECOVERY - Host payments2003 is UP: PING OK - Packet loss = 0%, RTA = 36.42 ms [01:40:51] RECOVERY - Juniper alarms on pfw-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [01:40:51] RECOVERY - Router interfaces on pfw-codfw is OK: OK: host 208.80.153.195, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 [01:50:11] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 4 failures [01:50:11] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 9 failures [01:55:11] RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [01:55:11] RECOVERY - check_puppetrun on fdb2001 is OK: OK: Puppet is currently enabled, last run 201 seconds ago with 0 failures [02:12:12] 06Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T151602#2822419 (10Sleepinglion) [02:18:20] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.3) (duration: 06m 48s) [02:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:40] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Nov 25 02:22:40 UTC 2016 (duration 4m 20s) [02:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:11] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 669.09 seconds [03:29:11] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 228.18 seconds [04:18:41] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3492.60 Read Requests/Sec=6262.00 Write Requests/Sec=7.00 KBytes Read/Sec=25094.80 KBytes_Written/Sec=3245.20 [04:25:41] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=205.30 Read Requests/Sec=317.40 Write Requests/Sec=1.00 KBytes Read/Sec=3244.00 KBytes_Written/Sec=326.40 [05:24:51] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 23 failures. Last run 2 minutes ago with 23 failures. Failed resources (up to 3 shown): Package[wipe],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service],Package[zsh-beta] [05:47:21] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:55:51] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:15:21] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:23:51] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:15:15] (03PS1) 10Marostegui: site.pp: db1044's binlog changed to ROW [puppet] - 10https://gerrit.wikimedia.org/r/323504 (https://phabricator.wikimedia.org/T150802) [07:19:03] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4672/" [puppet] - 10https://gerrit.wikimedia.org/r/323504 (https://phabricator.wikimedia.org/T150802) (owner: 10Marostegui) [07:22:08] (03PS1) 10Marostegui: db-eqiad.php: Added comment for db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323505 (https://phabricator.wikimedia.org/T150802) [07:26:53] 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2822534 (10Joe) @aaron why can't we just do something smarter and have a global throttling on a specific job type don... [07:40:51] (03CR) 10Jcrespo: [C: 031] site.pp: db1044's binlog changed to ROW [puppet] - 10https://gerrit.wikimedia.org/r/323504 (https://phabricator.wikimedia.org/T150802) (owner: 10Marostegui) [07:41:48] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Added comment for db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323505 (https://phabricator.wikimedia.org/T150802) (owner: 10Marostegui) [07:43:42] (03CR) 10Jcrespo: site.pp: db1044's binlog changed to ROW (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323504 (https://phabricator.wikimedia.org/T150802) (owner: 10Marostegui) [07:45:07] (03PS2) 10Marostegui: site.pp: Change db1044 binlog to ROW [puppet] - 10https://gerrit.wikimedia.org/r/323504 (https://phabricator.wikimedia.org/T150802) [07:51:45] !log Stopping replication db1052 for maintenance - T151607 [07:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:57] T151607: Rebuild old timestamp format tables - https://phabricator.wikimedia.org/T151607 [08:06:00] (03CR) 10Nikerabbit: [C: 04-1] Explicitly set cookieDomain for ContentTranslationSiteTemplates (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320200 (https://phabricator.wikimedia.org/T149879) (owner: 10KartikMistry) [08:14:17] jynus: Your mother is available, for sex? [08:17:24] _joe_: Your mother is available, for sex? [08:18:00] <_joe_> botwanker: she has black friday offers up, yes, you should run. I'm sure that would make you feel important too [08:18:30] @_joe_: Your mother is available, for fucking in the ass? [08:18:54] but in a more serious topic [08:19:07] I am fucking up all the wikimedia shit [08:19:09] all the chans [08:19:22] all of them, none shall be spared the wrath of me! [08:19:48] * _joe_ shrugs [08:20:11] hashar: Your mother is available, for sex? [08:34:15] sigh [08:34:41] Why people don't have nothing better to do than doing these things [08:35:49] bored / having fun [08:35:54] welcome to the internet! [08:37:06] <_joe_> p858snake|L2: there is a clear reason why I didn't ban him [08:37:20] <_joe_> it fuels his sense of self-importance [08:37:35] <_joe_> I was kinda managing the situation [08:38:57] _joe_: sorry, only just saw your comment here now. I disagree with your assessment, but I should have discussed it here beforehand. [08:47:03] <_joe_> valhallasw`cloud: it's ok :) [08:52:16] !log restarting Yarn and HDFS masters on analytics100[12] (Hadoop cluster) to complete the openjdk update [08:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:53] !log restarting zotero on sca1003, almost out of RAM, puppet failing [08:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:05] <_joe_> !log uploading hhvm_3.12.7+dfsg-1+wmf4 to apt [08:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:51] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:18:04] 06Operations, 10MediaWiki-Cache, 10Traffic: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#2822732 (10hashar) [09:19:40] 06Operations, 10MediaWiki-Cache, 10Traffic: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#2822737 (10hashar) I guess #traffic is sufficient. I have filled this task as potential material to look at high rate of cache purges which apparently is or will be an issue. N... [09:23:28] <_joe_> !log upgraded hhvm on the debug hosts [09:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:57] 06Operations, 10MediaWiki-Cache, 10Traffic: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#2822747 (10hashar) Looks like the CdnPurgeJob are intentionally NOT deduplicated! ``` name=includes/jobqueue/jobs/CdnPurgeJob.php, lang=php class CdnPurgeJob extends Job {... [09:33:01] 06Operations, 10Traffic: Varnishkafka and related VSM daemons seeing abandoned VSM logs - https://phabricator.wikimedia.org/T151563#2822762 (10elukey) I checked some hosts showing the same behavior as cp1055 and the type of request that causes the assert failure is always the same: ``` /w/api.php?action=quer... [09:44:28] 06Operations, 07Upstream: Trusty: debug information found in "/usr/lib/debug//usr/lib/php5/20121212/mysql.so" does not match "/usr/lib/php5/20121212/mysql.so" (CRC mismatch). - https://phabricator.wikimedia.org/T145706#2822771 (10hashar) 05Open>03declined Step to reproduce: ``` $ gdb /usr/lib/php5/20121212... [09:46:45] 06Operations: Puppet CA rollover - https://phabricator.wikimedia.org/T150823#2797837 (10Volans) And also of T111654 for MariaDB's usage of this CA. Adding #dba [09:56:10] 06Operations: Puppet CA rollover - https://phabricator.wikimedia.org/T150823#2797837 (10Joe) There are other systems using this CA, namely etcd. I would suggest the right course of action for all apps is: # create a new CA key/cert # Install it as a trusted CA on all live systems # Audit how applications use t... [10:10:23] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2822790 (10elukey) 05Open>03stalled [10:10:35] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2288419 (10elukey) This task is currently blocked by T137345 [10:13:51] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 622.14 seconds [10:14:14] ^ checking [10:14:39] there is no long running queries [10:15:26] marostegui, db1047 is an analytics, non-core server [10:15:56] ah fine [10:16:04] io reads has skyrocketed [10:16:05] bad analytics people causing trouble [10:16:10] XD [10:19:51] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 0.11 seconds [10:20:14] <_joe_> !log upgrading HHVM across codfw [10:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:29] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps, and 2 others: Requesting access to analytics-privatedata-users for technical user discovery-stats - https://phabricator.wikimedia.org/T151063#2822802 (10Gehel) 05Resolved>03Open >>! In T151063#2811510, @Ottomata wrote: > Users in the `stats` gro... [10:34:25] Hello. Can someone run a maintenance script (dry-run) on angwiki for me? [10:40:40] mafk, I do not think running such a script it ok in deployment freeze [10:41:04] if anything goes bad, nobody is here that will be able to help [10:41:13] jynus: dry-run is not the title of the script [10:41:31] it's for checking bad redirects in report-only mode, no changes made [10:41:40] but if can't be done, then ok [10:42:17] (03PS1) 10Alexandros Kosiaris: Add network default value to Grafana Server Board [puppet] - 10https://gerrit.wikimedia.org/r/323513 [10:42:29] mafk, precisely, the right people to help are not here [10:42:47] jynus: I understand [10:43:16] I'll note it on Phab for next week [10:44:26] (03CR) 10Alexandros Kosiaris: [C: 032] Add network default value to Grafana Server Board [puppet] - 10https://gerrit.wikimedia.org/r/323513 (owner: 10Alexandros Kosiaris) [10:50:39] (03PS1) 10Alexandros Kosiaris: Remove some more garbage from Grafana Server Board [puppet] - 10https://gerrit.wikimedia.org/r/323515 [10:50:48] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove some more garbage from Grafana Server Board [puppet] - 10https://gerrit.wikimedia.org/r/323515 (owner: 10Alexandros Kosiaris) [11:04:53] (03PS1) 10Elukey: Avoid Redis IPsec replication if the host doesn't need it. [puppet] - 10https://gerrit.wikimedia.org/r/323517 (https://phabricator.wikimedia.org/T137345) [11:05:19] !log uploaded libvmod-{netmapper,tbf,vslp} to carbon main component (T150660) [11:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:30] T150660: Post Varnish 4 migration cleanup - https://phabricator.wikimedia.org/T150660 [11:05:51] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:15:13] 06Operations, 06Performance-Team, 10Thumbor: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2822968 (10Gilles) [11:16:21] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [11:17:21] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3869934 keys, up 25 days 2 hours - replication_delay is 0 [11:22:17] (03PS2) 10Elukey: Avoid Redis IPsec replication if the host doesn't need it. [puppet] - 10https://gerrit.wikimedia.org/r/323517 (https://phabricator.wikimedia.org/T137345) [11:33:51] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:50:13] 06Operations, 06Performance-Team, 10Thumbor: Investigate source of thumbnail 302 redirects - https://phabricator.wikimedia.org/T148410#2822991 (10Gilles) @fgiunchedi do I have access to any of the varnish production frontends? I'd like to run something like this: `varnishlog -q "RespStatus eq 302" -i ReqHea... [11:56:18] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#2823011 (10Marostegui) p:05Triage>03Normal [11:57:01] 06Operations, 10ops-codfw, 10ops-ulsfo: cp4008 power supply failure - https://phabricator.wikimedia.org/T151275#2823013 (10Volans) p:05Triage>03Normal [11:57:53] 06Operations, 10DBA, 13Patch-For-Review: db1092 crash - https://phabricator.wikimedia.org/T151272#2823014 (10Marostegui) p:05Triage>03Normal [11:59:25] 06Operations, 07Documentation: Proper documentation for Yubico 2FA for production use - https://phabricator.wikimedia.org/T151050#2823015 (10Volans) p:05Triage>03Low [11:59:52] 06Operations: Run systematic availability tests - https://phabricator.wikimedia.org/T151049#2823016 (10Volans) p:05Triage>03Normal [12:00:07] 06Operations: Icinga monitoring for Yubikey components - https://phabricator.wikimedia.org/T151048#2823044 (10Volans) p:05Triage>03Normal [12:00:26] 06Operations: Integrate Yubikey into data.yaml - https://phabricator.wikimedia.org/T151047#2823045 (10Volans) p:05Triage>03Normal [12:00:50] 06Operations: Fully puppetise yubikey-val - https://phabricator.wikimedia.org/T151046#2823046 (10Volans) p:05Triage>03Normal [12:01:04] 06Operations: Extending Yubico 2FA for production use (meta bug) - https://phabricator.wikimedia.org/T151045#2823047 (10Volans) p:05Triage>03Normal [12:02:06] 06Operations: Internal PKI for secure communication - Barcelona Ops offsite 2016 - https://phabricator.wikimedia.org/T150822#2823048 (10Volans) p:05Triage>03Normal [12:04:11] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:15:02] 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2758792 (10Volans) This host was left in scheduled downtime and it was scheduled for a long downtime. I've removed the downtime to the host and all services. [12:17:25] 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure, 10Monitoring: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#2823158 (10Volans) 05Resolved>03Open `labtestcontrol2001` host and many of it's services are in scheduled downtime on Icinga. If this was... [12:23:17] 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure, 10Monitoring: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#2823179 (10Volans) Similar situation of scheduled downtime for the host and some of the services also for: - `labtestnet2001`: all ok - `labte... [12:32:11] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:44:00] 06Operations: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#2823217 (10Volans) [12:54:47] (03PS1) 10Jcrespo: [WIP] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) [13:02:27] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [13:27:00] 06Operations, 10Traffic: Varnishkafka and related VSM daemons seeing abandoned VSM logs - https://phabricator.wikimedia.org/T151563#2823275 (10elukey) p:05Triage>03Normal [13:31:38] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2610832 (10Reedy) Ok, so what username are you using on wikitech? Is it Shoichi too? I notice that isn't linked to your Wikitech account, but is to your normal wiki account... [13:40:27] (03PS1) 10Elukey: Remove cron notifications to root@ for jobchron/runner service status [puppet] - 10https://gerrit.wikimedia.org/r/323528 (https://phabricator.wikimedia.org/T132324) [13:54:54] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2823313 (10Shoichi) >>! In T144805#2823276, @Reedy wrote: > Ok, so what username are you using on wikitech? Is it Shoichi too? I notice that isn't linked to your Wikitech account, but is to your n... [14:16:18] (03PS1) 10Ema: varnishapi.py: add VSM_Close C binding [puppet] - 10https://gerrit.wikimedia.org/r/323530 (https://phabricator.wikimedia.org/T151561) [14:16:20] !log delete oathauth row on wikitech for user Shoichi per T144805 [14:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:31] T144805: Can't login wikitech - https://phabricator.wikimedia.org/T144805 [14:19:17] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2823372 (10Reedy) This has now been done :) [14:20:11] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:20:37] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2823375 (10Liuxinyu970226) >>! In T144805#2823372, @Reedy wrote: > This has now been done :) me too... [14:22:17] !log delete oathauth row on wikitech for user Liuxinyu970226 per T144805 [14:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:29] T144805: Can't login wikitech - https://phabricator.wikimedia.org/T144805 [14:28:59] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2823403 (10Reedy) 05Open>03Resolved a:03Reedy [14:48:45] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/323528 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [14:48:49] (03PS3) 10Ema: varnish: rename scripts depending on varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/323423 (https://phabricator.wikimedia.org/T150660) [14:48:56] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2823419 (10faidon) This has been opened with JTAC as case [[ https://casemanager.juniper.net/casemanager/#/cmdetails/2016-1125-0413 | 2016-1125-0413 ]]. [14:48:59] (03CR) 10Ema: [C: 032 V: 032] varnish: rename scripts depending on varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/323423 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [14:49:11] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:55:30] (03CR) 10Faidon Liambotis: [C: 032] "LGTM but please test it after deploying it, too critical to fail." [puppet] - 10https://gerrit.wikimedia.org/r/322362 (owner: 10Volans) [14:58:07] (03CR) 10Faidon Liambotis: [C: 031] add additional information on malformed responses [software/service-checker] - 10https://gerrit.wikimedia.org/r/321714 (https://phabricator.wikimedia.org/T150560) (owner: 10Volans) [14:58:36] Amir1: ping [15:03:04] (03CR) 10Faidon Liambotis: [C: 031] "This is nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/322249 (https://phabricator.wikimedia.org/T151043) (owner: 10Volans) [15:09:42] 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2823445 (10matmarex) @joe's runJobs.php run has brought the queue back to normal levels, but it is growing again at a... [15:11:06] 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2823448 (10Joe) @matmarex agreed, I was waiting for about one day of data to accrue, but it's clearly a good idea to... [15:18:19] (03CR) 10Faidon Liambotis: [C: 032] Remove cron notifications to root@ for jobchron/runner service status [puppet] - 10https://gerrit.wikimedia.org/r/323528 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [15:18:23] (03PS2) 10Faidon Liambotis: Remove cron notifications to root@ for jobchron/runner service status [puppet] - 10https://gerrit.wikimedia.org/r/323528 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [15:21:01] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:21:44] (03CR) 10Elukey: [C: 031] varnishapi.py: add VSM_Close C binding [puppet] - 10https://gerrit.wikimedia.org/r/323530 (https://phabricator.wikimedia.org/T151561) (owner: 10Ema) [15:23:02] (03PS2) 10Ema: varnishapi.py: add VSM_Close C binding [puppet] - 10https://gerrit.wikimedia.org/r/323530 (https://phabricator.wikimedia.org/T151561) [15:23:05] (03CR) 10Elukey: "Maybe we could wait for an official release from upstream to bump the varniapi.py version number" [puppet] - 10https://gerrit.wikimedia.org/r/323530 (https://phabricator.wikimedia.org/T151561) (owner: 10Ema) [15:23:11] (03CR) 10Ema: [C: 032 V: 032] varnishapi.py: add VSM_Close C binding [puppet] - 10https://gerrit.wikimedia.org/r/323530 (https://phabricator.wikimedia.org/T151561) (owner: 10Ema) [15:23:12] FlorianSW: I'm just back [15:23:14] what's up [15:24:02] Amir1: hi :) I already wrote my question in phabricator: It seems you're not registered in the Google code-in platform, right? [15:24:25] Yes, Can I talk to you privately? [15:25:45] sure :) [15:33:49] FlorianSW, Amir1: Just FYI I'm also around (and still need to reply to Amir1's email, sigh...) [15:33:56] (and probably wrong channel here) [15:35:17] o/ [15:43:11] (03PS1) 10Hashar: Shinken wmflabs: remove Chris McMahon [puppet] - 10https://gerrit.wikimedia.org/r/323536 [15:43:23] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2823470 (10zhuyifei1999) Um.. @shizhao? [15:44:58] (03PS1) 10Hashar: Remove Antoine from beta cluster notifications [puppet] - 10https://gerrit.wikimedia.org/r/323537 [15:49:01] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:51:05] (03CR) 10Aude: [C: 031] Use entity types for the repoNamespaces Wikibase client setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323347 (owner: 10Hoo man) [15:52:22] 06Operations, 10Traffic: Varnishkafka seeing abandoned VSM logs - https://phabricator.wikimedia.org/T151563#2823500 (10elukey) [15:58:22] 06Operations, 10Traffic: varnishlog daemons seeing Log overrun constantly - https://phabricator.wikimedia.org/T151643#2823504 (10elukey) [16:08:22] (03CR) 10Alex Monk: "I think you forgot to delete the old file" [puppet] - 10https://gerrit.wikimedia.org/r/319384 (owner: 10Rush) [16:09:21] 06Operations, 10Traffic: varnishlog daemons seeing Log overrun constantly - https://phabricator.wikimedia.org/T151643#2823522 (10elukey) p:05Triage>03High [16:19:33] 06Operations, 10Traffic: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#2823540 (10elukey) [16:35:23] (03PS1) 10Chad: gerrit (2.13.3-wmf.1) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/323545 (https://phabricator.wikimedia.org/T146350) [16:47:41] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:48:41] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [16:53:41] (03PS4) 10Volans: RAID: get RAID status improvement for MegaCLI [puppet] - 10https://gerrit.wikimedia.org/r/322249 (https://phabricator.wikimedia.org/T151043) [17:15:00] !log drop database vewikimedia (deleted wiki) from sanitarium and its slaves [17:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:04] (03CR) 10Paladox: [C: 031] gerrit (2.13.3-wmf.1) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/323545 (https://phabricator.wikimedia.org/T146350) (owner: 10Chad) [17:23:48] (03PS1) 10Mobrovac: RESTBase: Add the PDF Render service config [puppet] - 10https://gerrit.wikimedia.org/r/323548 [17:28:44] (03PS2) 10Mobrovac: RESTBase: Add the PDF Render service config [puppet] - 10https://gerrit.wikimedia.org/r/323548 [17:32:21] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:32:31] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:33:43] is tools-labs servers operating okay it appars one of them may be needing a restart or something. [17:34:34] Zppix, which one? [17:34:54] Krenair the one whom earwig's copyvio tool is running on [17:35:07] got a link? [17:35:12] to ? [17:35:18] the tool [17:35:22] https://tools.wmflabs.org/copyvios [17:36:00] 208.80.155.131:80 Krenair if that helps [17:36:17] it doesn't really [17:36:30] so what's wrong with the tool? [17:36:32] ok... i wish i could tell you the exact server(s) [17:37:11] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:37:17] I know which server it's running on [17:37:21] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [17:38:24] (03PS2) 10Jcrespo: [WIP] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) [17:38:26] but I don't see anything wrong with the tool Zppix [17:39:05] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [17:39:05] Krenair it wasnt loading at its normal rate earlier [17:39:22] im just alerting someone so it doesnt come back to bite someone in the ass later (pardon my french) [17:40:08] well from what I can tell the server it's running on is fine [17:40:34] Krenair okay. I just wanted to let someone know incase it was a server issue. [17:40:43] okay [17:40:56] If it does it again I will let you know [17:41:46] the proxy in front of tools looks okay too [17:41:56] ok [17:45:31] Krenair for future reference what is the server name, so I dont need to have someone go digging. [17:45:51] it's on the tools grid, so it can change [17:46:08] ah, ok [17:46:21] surpised its not kubectl [17:52:53] I'm not really [17:53:16] I don't know if we have stats on the use of each system but the grid is still very much in active use [17:54:57] everything on my tool is on kubectl (my webservice and my bot) [18:02:55] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2823628 (10Paladox) I found this https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic [18:07:22] 06Operations, 10media-storage: Consider storage policies for swift - https://phabricator.wikimedia.org/T151648#2823630 (10fgiunchedi) [18:13:44] (03CR) 10Ppchelko: [C: 031] RESTBase: Add the PDF Render service config [puppet] - 10https://gerrit.wikimedia.org/r/323548 (owner: 10Mobrovac) [18:16:11] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:43:11] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [18:51:01] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:18:01] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [19:26:02] (03CR) 10Aklapper: "This can be ABANDONED as T151148 has been resolved." [puppet] - 10https://gerrit.wikimedia.org/r/322781 (https://phabricator.wikimedia.org/T151148) (owner: 1020after4) [19:43:31] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:44:42] 06Operations, 10Traffic, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2823701 (10fgiunchedi) The error should have been fixed upstream by https://github.com/jonnenauha/prometheus_varnish_ex... [20:09:34] !log mwscript deleteEqualMessages.php --wiki angwiki (T45917) [20:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:45] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [20:12:31] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:20:12] (03PS1) 10Urbanecm: [throttle] Exception for #MOWomenOnWikipedia Edit-A-Thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323555 (https://phabricator.wikimedia.org/T151650) [20:40:29] (03PS1) 10Aude: Move interwiki sorting orders to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323556 (https://phabricator.wikimedia.org/T111023) [20:43:34] (03PS2) 10Urbanecm: [throttle] Exception for #MOWomenOnWikipedia Edit-A-Thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323555 (https://phabricator.wikimedia.org/T151650) [20:45:33] (03PS1) 10Urbanecm: [throttle] Remove old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323557 [21:23:51] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:25:28] (03PS1) 10Filippo Giunchedi: prometheus: add vhtcpd stats via node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) [21:52:51] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [22:00:31] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [22:01:21] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3915068 keys, up 25 days 13 hours - replication_delay is 0 [22:09:11] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:13:07] (03PS10) 10Paladox: Add gbp.conf file for debian [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 [22:14:16] (03PS11) 10Chad: Add gbp.conf file for debian [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 (owner: 10Paladox) [22:14:52] (03CR) 10Paladox: "recheck" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 (owner: 10Paladox) [22:15:01] (03CR) 10Chad: [C: 032] Add gbp.conf file for debian [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 (owner: 10Paladox) [22:15:05] (03CR) 10Chad: [V: 032] Add gbp.conf file for debian [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 (owner: 10Paladox) [22:16:29] (03PS2) 10Chad: gerrit (2.13.3-wmf.1) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/323545 (https://phabricator.wikimedia.org/T146350) [22:38:11] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [22:59:01] (03CR) 10Volans: "A couple of minor inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) (owner: 10Filippo Giunchedi) [23:01:21] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:10:51] PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:28:33] (03PS2) 10Filippo Giunchedi: prometheus: add vhtcpd stats via node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) [23:29:21] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [23:31:42] (03Abandoned) 1020after4: Allow aklapper to `sudo -E` phabricator admin utilities [puppet] - 10https://gerrit.wikimedia.org/r/322781 (https://phabricator.wikimedia.org/T151148) (owner: 1020after4) [23:38:51] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures