[00:25:17] <urandom>	 !log T169939: Decommissioning restbase1010-a.eqiad.wmnet
[00:25:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:25:32] <stashbot>	 T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939
[00:25:49] <wikibugs_>	 (03PS1) 10Reedy: $wgScoreSafeMode = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374445 (https://phabricator.wikimedia.org/T174413)
[00:26:47] <wikibugs_>	 (03PS1) 10Ebe123: Set $wgScoreSafeMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413)
[00:30:33] <wikibugs_>	 (03CR) 10MZMcBride: "Dupe of <https://gerrit.wikimedia.org/r/374445>?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) (owner: 10Ebe123)
[00:37:37] <wikibugs_>	 (03CR) 10Ebe123: "> Dupe of <https://gerrit.wikimedia.org/r/374445>?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) (owner: 10Ebe123)
[00:40:44] <wikibugs_>	 (03CR) 10MZMcBride: "True. :-)  And you filed the Phabricator task. I think Reedy should abandon his changeset." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) (owner: 10Ebe123)
[00:49:52] <wikibugs_>	 (03Abandoned) 10Reedy: $wgScoreSafeMode = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374445 (https://phabricator.wikimedia.org/T174413) (owner: 10Reedy)
[01:04:23] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379
[01:05:24] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9813380 keys, up 5 minutes 17 seconds - replication_delay is 0
[01:36:44] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 40
[02:07:53] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.114:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[02:10:33] <wikibugs_>	 10Operations, 10ops-eqiad, 10Services (doing): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3560918 (10Eevans) We needed to decommission a node in rack 'a' as part of {T169939}, that was going to be 1007 (for consistency sake), but restbase1010 has been decommissioned ins...
[02:13:54] <wikibugs_>	 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3560922 (10Eevans)
[02:15:03] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a SSL 10.192.48.46:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned (T169939)
[02:16:03] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a SSL 10.64.0.114:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned (T169939)
[02:27:26] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.15) (duration: 08m 00s)
[02:27:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:34:10] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Aug 29 02:34:09 UTC 2017 (duration 6m 44s)
[02:34:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:56:13] <icinga-wm>	 PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100%
[03:25:19] <wikibugs_>	 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3560941 (10Papaul) @madhuvishy  I took a quick look at labstore2001 the H800 controller doesn't allow me to create a RAID...
[03:28:43] <icinga-wm>	 RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms
[03:31:54] <icinga-wm>	 PROBLEM - Host labstore2002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:32:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.09 seconds
[03:53:53] <icinga-wm>	 RECOVERY - Host labstore2002 is UP: PING OK - Packet loss = 0%, RTA = 36.27 ms
[04:41:24] <wikibugs_>	 (03CR) 10Phedenskog: "Anything more that needs to be done on this? I'm waiting for this to be pushed before I can push my changes :)" [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle)
[04:44:27] <wikibugs_>	 10Operations, 10Dumps-Generation, 10Patch-For-Review: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3561021 (10ArielGlenn) A few more thoughts.  I should stop thinking of this as an rsync and instead think of it as a copy of files that don't exist/need updat...
[04:55:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 264.49 seconds
[05:27:35] <wikibugs_>	 (03CR) 10Reception123: [C: 031] Gerrit: Set auth.userNameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/368196 (owner: 10Paladox)
[05:33:55] <wikibugs_>	 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3561111 (10madhuvishy) @Papaul, Hardware RAID 10 on both labstore2001 and 2002, with 6 or 8 disks per logical/virtual RAID...
[05:49:00] <logmsgbot>	 !log demon@tin Pruned MediaWiki: 1.30.0-wmf.11 (duration: 02m 50s)
[05:49:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:50:44] <wikibugs_>	 (03PS1) 10Chad: Fixing indentation warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374453
[05:58:46] <wikibugs_>	 (03CR) 10Chad: [C: 032] Fixing indentation warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374453 (owner: 10Chad)
[06:00:11] <wikibugs_>	 (03PS2) 10Marostegui: Add electcomwiki to private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/374384 (https://phabricator.wikimedia.org/T174370) (owner: 10Reedy)
[06:00:13] <wikibugs_>	 (03Merged) 10jenkins-bot: Fixing indentation warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374453 (owner: 10Chad)
[06:00:27] <wikibugs_>	 (03CR) 10jenkins-bot: Fixing indentation warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374453 (owner: 10Chad)
[06:01:12] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] Add electcomwiki to private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/374384 (https://phabricator.wikimedia.org/T174370) (owner: 10Reedy)
[06:05:01] <marostegui>	 !log Restart MariaDB on db1102 and db1095 to pick up new replication filters - T174385
[06:05:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:16] <stashbot>	 T174385: Prepare and check storage layer for electcomwiki - https://phabricator.wikimedia.org/T174385
[06:09:34] <logmsgbot>	 !log demon@tin Synchronized scap/plugins/clean.py: no-op, for consistency (duration: 00m 43s)
[06:09:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:59] <wikibugs_>	 10Operations, 10DBA, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3561141 (10Marostegui)
[06:29:12] <wikibugs_>	 10Operations, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3561142 (10Marostegui) I have closed the ticket that relates to the DBAs (add the replication filters and restart MariaDB on the sanitarium hosts).  Going to remove the...
[06:44:03] <icinga-wm>	 PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 57491 MB (3% inode=97%)
[06:56:28] <moritzm>	 !log installing ghostscript security updates on trusty (Debian already fixed)
[06:56:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:35] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374476 (https://phabricator.wikimedia.org/T168661)
[07:06:20] <wikibugs_>	 10Operations, 10Traffic: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3561169 (10ema) 05Open>03Resolved Looks good, thanks @Cmjohnson!
[07:13:18] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374476 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui)
[07:14:50] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374476 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui)
[07:16:02] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097 - T168661 (duration: 00m 42s)
[07:16:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:16] <stashbot>	 T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661
[07:17:38] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374476 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui)
[07:19:58] <wikibugs_>	 (03PS1) 10Marostegui: mariadb: Update db1091 socket location [puppet] - 10https://gerrit.wikimedia.org/r/374487 (https://phabricator.wikimedia.org/T148507)
[07:32:45] <icinga-wm>	 RECOVERY - Check systemd state on mw1259 is OK: OK - running: The system is fully operational
[07:33:20] <elukey>	 systemctl reset-failed puppet.service --^
[07:41:18] <marostegui>	 !log Upgrade MariaDB on db1091 to 10.0.32 - T168661
[07:41:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:31] <stashbot>	 T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661
[07:45:28] <wikibugs_>	 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Audit groups of metrics in Graphite that allocate a lot of disk space - https://phabricator.wikimedia.org/T1075#3561227 (10fgiunchedi)
[07:45:51] <wikibugs_>	 10Operations, 10Analytics: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3561228 (10fgiunchedi)
[07:46:03] <wikibugs_>	 10Operations, 10Analytics, 10monitoring: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3106537 (10fgiunchedi)
[07:46:06] <godog>	 elukey: reopened ^
[07:47:07] <elukey>	 godog: ack!
[07:49:24] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 04-1] "ffmpeg2theora is current still needed in the TMH extension to differentiate between Ogg Vorbis audio files and Ogg Theora video files, nee" [puppet] - 10https://gerrit.wikimedia.org/r/373733 (https://phabricator.wikimedia.org/T172445) (owner: 10Muehlenhoff)
[07:49:24] <elukey>	 godog: in the meantime I'll try to delete mtime +7
[07:49:59] <godog>	 elukey: ok thanks!
[07:51:14] <icinga-wm>	 RECOVERY - Disk space on graphite1001 is OK: DISK OK
[07:55:05] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] mariadb: Update db1091 socket location [puppet] - 10https://gerrit.wikimedia.org/r/374487 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui)
[07:56:09] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: git::clone: enhance compatibility with the future parser [puppet] - 10https://gerrit.wikimedia.org/r/374321 (https://phabricator.wikimedia.org/T171704)
[07:57:36] <wikibugs_>	 (03PS1) 10Elukey: role::graphite::production: lower down eventstreams rdkafka retention [puppet] - 10https://gerrit.wikimedia.org/r/374500 (https://phabricator.wikimedia.org/T160644)
[07:57:45] <elukey>	 godog: --^
[07:58:05] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] git::clone: enhance compatibility with the future parser [puppet] - 10https://gerrit.wikimedia.org/r/374321 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto)
[07:58:13] <wikibugs_>	 (03CR) 10Elukey: [C: 032] role::graphite::production: lower down eventstreams rdkafka retention [puppet] - 10https://gerrit.wikimedia.org/r/374500 (https://phabricator.wikimedia.org/T160644) (owner: 10Elukey)
[07:58:18] <wikibugs_>	 (03PS2) 10Elukey: role::graphite::production: lower down eventstreams rdkafka retention [puppet] - 10https://gerrit.wikimedia.org/r/374500 (https://phabricator.wikimedia.org/T160644)
[07:58:21] <wikibugs_>	 (03CR) 10Elukey: [V: 032 C: 032] role::graphite::production: lower down eventstreams rdkafka retention [puppet] - 10https://gerrit.wikimedia.org/r/374500 (https://phabricator.wikimedia.org/T160644) (owner: 10Elukey)
[07:58:29] <_joe_>	 elukey: merge my change when you're done
[07:58:53] <elukey>	 _joe_ ack
[07:59:00] <_joe_>	 and btw now verification takes below 20 seconds, don't V+2 yourself
[07:59:54] <elukey>	 _joe_ I got +2 literally two seconds before that
[08:01:04] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: role::mariadb::misc: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374349 (https://phabricator.wikimedia.org/T171704)
[08:01:52] <wikibugs_>	 10Operations, 10Analytics, 10monitoring, 10Patch-For-Review: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3561239 (10elukey) The other step to take would be to limit the amount of data that we store for librkafka, because with so many clients it is impossible to keep track o...
[08:02:42] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374501 (https://phabricator.wikimedia.org/T168661)
[08:03:15] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 031] Icinga: Add basic monitoring for routers' active RE [puppet] - 10https://gerrit.wikimedia.org/r/374435 (https://phabricator.wikimedia.org/T174397) (owner: 10Ayounsi)
[08:03:31] <wikibugs_>	 (03PS2) 10Marostegui: db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374501 (https://phabricator.wikimedia.org/T168661)
[08:04:17] <wikibugs_>	 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3561243 (10elukey)
[08:04:20] <wikibugs_>	 10Operations, 10ops-codfw, 10Patch-For-Review: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3561241 (10elukey) 05Open>03Resolved Closing this task since the hw issue should have been resolved. Will re-open if necessary. Thanks @Papaul  for the work done!
[08:06:37] <elukey>	 !log drop log.MobileWebUIClickTracking_10742159_15423246 from dbstore1002 to free space (table archived on HDFS) - T172322 T168303
[08:06:43] <elukey>	 marostegui: --^
[08:06:48] <marostegui>	 <3
[08:06:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:52] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374501 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui)
[08:06:52] <stashbot>	 T172322: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322
[08:06:52] <stashbot>	 T168303: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303
[08:06:53] <wikibugs_>	 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3561251 (10akosiaris) 05stalled>03Open a:03fgiunchedi Since the upgrade is done, I am reverting actions taken in T171167#3536747 and T171167#3...
[08:06:58] <marostegui>	 is it a big one?
[08:08:16] <elukey>	 marostegui: 500GB on paper, but probably a lot less on disk
[08:08:19] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374501 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui)
[08:08:28] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374501 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui)
[08:09:19] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1091 with low weight - T168661 (duration: 00m 43s)
[08:09:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:31] <stashbot>	 T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661
[08:13:17] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 031] "With notification_options   => 'c,r,f' for monitoring::service, yes this will work as advertised." [puppet] - 10https://gerrit.wikimedia.org/r/374368 (https://phabricator.wikimedia.org/T172131) (owner: 10Herron)
[08:13:29] <wikibugs_>	 (03PS2) 10ArielGlenn: add user and directory setup to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374242 (https://phabricator.wikimedia.org/T169849)
[08:15:20] <wikibugs_>	 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3561258 (10fgiunchedi) For power usage my first attempt is something like this to calculate watts for 3 phase PDU: `sum(current) * avg(voltage) * sqrt(3)`  Or a...
[08:20:00] <moritzm>	 !log reimaging mw1169 (video scaler) to jessie 
[08:20:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:37] <wikibugs_>	 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3561262 (10fgiunchedi) 05Open>03Resolved Thanks @akosiaris @ayounsi ! No more invalid metrics in graphite logs AFAICS, resolving!
[08:32:03] <hashar>	 !log apt-get upgrade on contint1001 and contint2001
[08:32:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:42] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Increase db1091 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374503
[08:34:57] <wikibugs_>	 (03CR) 10Filippo Giunchedi: "LGTM! Please add a new entry to debian/changelog with a new version number (1.2-3). The "dch" tool in "devscripts" package can help you do" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/373595 (https://phabricator.wikimedia.org/T161719) (owner: 10Matthias Mullie)
[08:35:55] <gehel>	 !log restart wdqs-updater on wdqs2001
[08:36:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:11] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1091 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374503 (owner: 10Marostegui)
[08:39:22] <wikibugs_>	 (03CR) 10Filippo Giunchedi: "> So this is ~50G for the data raid-1?  If so, that seems to be about" [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi)
[08:40:38] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1091 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374503 (owner: 10Marostegui)
[08:40:48] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase db1091 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374503 (owner: 10Marostegui)
[08:41:47] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1091 weight - T168661 (duration: 00m 43s)
[08:41:49] <moritzm>	 !log restarting archiva to pick up openjdk security update
[08:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:02] <stashbot>	 T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661
[08:42:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:13] <akosiaris>	 !log upload kubernetes_1.7.4-1 to apt.wikimedia.org/stretch-wikimedia/main T170119
[08:42:19] <elukey>	 !log restart yarn/hdfs daemons for openjdk security updates
[08:42:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:25] <stashbot>	 T170119: Upgrade to kubernetes >=1.5 - https://phabricator.wikimedia.org/T170119
[08:42:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:07] <hashar>	 !log Restarting Jenkins for openjdk update
[08:43:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:26] <wikibugs_>	 10Operations, 10media-storage: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374#3561301 (10Nick) Also needs File:Literature II tom, Harutyun Surkhatian.djvu deleted, created by same user with same deleti...
[08:44:56] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: install_server: add partman for cassandra JBOD [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939)
[08:45:40] <akosiaris>	 !log reprepro copy calico, calico-cni from jessie-wikimedia to stretch-wikimedia (apt.wikimedia.org) T170119 
[08:45:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:52] <wikibugs_>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: send wdqs logs to logstash - https://phabricator.wikimedia.org/T172710#3561315 (10Gehel) Logs are now sent to logstash, but the "host" field isn't set correctly (its value is always "%{HOSTNAME}". Some analysis:  * logs...
[08:53:20] <ema>	 !log upgrading cache_text to varnish 4.1.8-1wm1
[08:53:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:55] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] role::mariadb::misc: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374349 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto)
[09:07:05] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: phabricator::logmail: fix scoping of templates [puppet] - 10https://gerrit.wikimedia.org/r/374350 (https://phabricator.wikimedia.org/T171704)
[09:07:08] <wikibugs_>	 (03PS3) 10ArielGlenn: add user and directory setup to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374242 (https://phabricator.wikimedia.org/T169849)
[09:14:48] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374509
[09:17:06] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374509 (owner: 10Marostegui)
[09:18:31] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374509 (owner: 10Marostegui)
[09:18:41] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374509 (owner: 10Marostegui)
[09:19:28] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1091 original weight - T168661 (duration: 00m 43s)
[09:19:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:42] <stashbot>	 T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661
[09:24:31] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 for MariaDB upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374510 (https://phabricator.wikimedia.org/T168661)
[09:27:11] <wikibugs_>	 (03PS1) 10Alexandros Kosiaris: Reimage kubernetes1004, chlorine as stretch [puppet] - 10https://gerrit.wikimedia.org/r/374511 (https://phabricator.wikimedia.org/T170119)
[09:27:58] <icinga-wm>	 RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[09:28:01] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: phabricator::logmail: fix scoping of templates [puppet] - 10https://gerrit.wikimedia.org/r/374350 (https://phabricator.wikimedia.org/T171704)
[09:28:59] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1169 is CRITICAL: Host mw1169 is not in mediawiki-installation dsh group
[09:29:13] <elukey>	 !log re-installed pmacct/librdkafka1/kafkacat on rhenium with stretch versions - T173489
[09:29:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:28] <stashbot>	 T173489: pmacct should be upgraded to 1.6.2 on Stretch - https://phabricator.wikimedia.org/T173489
[09:29:41] <elukey>	 paravoid: --^
[09:29:50] <elukey>	 so puppet is re-enabled
[09:33:38] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1169.eqiad.wmnet
[09:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:15] <wikibugs_>	 (03Abandoned) 10Elukey: profile::pmacct: pin librdkafka to stretch version [puppet] - 10https://gerrit.wikimedia.org/r/374360 (https://phabricator.wikimedia.org/T173489) (owner: 10Elukey)
[09:36:11] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1064 for MariaDB upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374510 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui)
[09:37:38] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 for MariaDB upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374510 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui)
[09:37:47] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 for MariaDB upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374510 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui)
[09:38:43] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1064 for a mariadb upgrade - T168661 (duration: 00m 43s)
[09:38:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:57] <stashbot>	 T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661
[09:39:41] <marostegui>	 !log Update MariaDB on db1064 to 10.0.32 - T168661
[09:39:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:15] <wikibugs_>	 (03PS1) 10Gehel: wdqs - change logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710)
[09:40:38] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] wdqs - change logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel)
[09:41:43] <wikibugs_>	 (03PS2) 10Gehel: wdqs - change logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710)
[09:42:10] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] wdqs - change logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel)
[09:43:00] <wikibugs_>	 (03PS3) 10Gehel: wdqs - logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710)
[09:45:03] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] phabricator::logmail: fix scoping of templates [puppet] - 10https://gerrit.wikimedia.org/r/374350 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto)
[09:45:31] <wikibugs_>	 (03PS1) 10Marostegui: mariadb: Update socket location for db1064 [puppet] - 10https://gerrit.wikimedia.org/r/374514 (https://phabricator.wikimedia.org/T148507)
[09:45:51] <wikibugs_>	 (03PS2) 10Marostegui: mariadb: Update socket location for db1064 [puppet] - 10https://gerrit.wikimedia.org/r/374514 (https://phabricator.wikimedia.org/T148507)
[09:46:39] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] mariadb: Update socket location for db1064 [puppet] - 10https://gerrit.wikimedia.org/r/374514 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui)
[09:47:15] <marostegui>	 _joe_: is it ok to merge your changes?
[09:47:29] <_joe_>	 marostegui: yeah sorry
[09:47:36] <marostegui>	 np, will do it now
[09:47:37] <_joe_>	 I was verifying the next ones
[09:50:36] <wikibugs_>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1064 for MariaDB upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374515
[09:51:56] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: role::mariadb::misc::phabricator: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374351 (https://phabricator.wikimedia.org/T171704)
[09:54:25] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] role::mariadb::misc::phabricator: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374351 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto)
[09:57:01] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1064 for MariaDB upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374515 (owner: 10Marostegui)
[09:57:25] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: requesttracker::config: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374352 (https://phabricator.wikimedia.org/T171704)
[09:58:22] <wikibugs_>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064 for MariaDB upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374515 (owner: 10Marostegui)
[09:58:32] <wikibugs_>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064 for MariaDB upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374515 (owner: 10Marostegui)
[09:59:26] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1064 - T168661 (duration: 00m 43s)
[09:59:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:39] <stashbot>	 T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661
[10:00:37] <icinga-wm>	 PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
[10:02:37] <_joe_>	 uhm
[10:02:43] <_joe_>	 that host has troubles
[10:03:23] <wikibugs_>	 (03PS3) 10Jcrespo: mariadb: Decommission db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373869 (https://phabricator.wikimedia.org/T174076)
[10:03:28] <elukey>	 nooooooo
[10:03:36] * elukey cries
[10:03:43] <elukey>	 we just replaced the mainboard...
[10:03:50] <elukey>	 and again it breaks..
[10:05:26] <volans>	 so it was not that :D
[10:05:50] <wikibugs_>	 10Operations, 10ops-codfw, 10Patch-For-Review: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3561443 (10elukey) 05Resolved>03Open
[10:05:52] <wikibugs_>	 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3561444 (10elukey)
[10:06:41] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Decommission db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373869 (https://phabricator.wikimedia.org/T174076) (owner: 10Jcrespo)
[10:07:04] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] requesttracker::config: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374352 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto)
[10:08:11] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Decommission db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373869 (https://phabricator.wikimedia.org/T174076) (owner: 10Jcrespo)
[10:08:20] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Decommission db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373869 (https://phabricator.wikimedia.org/T174076) (owner: 10Jcrespo)
[10:08:27] <wikibugs_>	 10Operations, 10ops-codfw, 10Patch-For-Review: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3561451 (10elukey) Host frozen again, not responding to ssh and pings, com2 shows `[82623.895993] g`
[10:10:07] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: ganglia::gmetad::rrdcached: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374353 (https://phabricator.wikimedia.org/T171704)
[10:14:29] <jynus>	 we should depool mw2256 from deployment
[10:15:09] <icinga-wm>	 RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms
[10:15:10] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: decom db1028 (duration: 02m 48s)
[10:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:27] <elukey>	 jynus: yep it was until yesterday, I thought we solved the issue, going to depool again
[10:16:25] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: decom db1028 (duration: 00m 42s)
[10:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:20] <elukey>	 moritzm: something strange is that mw2256 is running with 4.9.30-2+deb9u2~bpo8+1
[10:18:33] <elukey>	 just noticed it
[10:18:39] <elukey>	 the others of its batch have 4.9.25-1~bpo8+3
[10:19:04] <wikibugs_>	 (03CR) 10Filippo Giunchedi: "PCC says yes https://puppet-compiler.wmflabs.org/compiler02/7637/lvs1003.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/372517 (https://phabricator.wikimedia.org/T172930) (owner: 10Filippo Giunchedi)
[10:19:44] <elukey>	 ah maybe it was due to the last reimage, doesn't count much
[10:20:30] <wikibugs_>	 (03CR) 10Ema: [C: 031] hieradata: bump ProxyFetch timeout for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372517 (https://phabricator.wikimedia.org/T172930) (owner: 10Filippo Giunchedi)
[10:22:12] <elukey>	 it was 4.9.25-1~bpo8+3 when we sent the sos report, just checked
[10:22:37] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: hieradata: bump ProxyFetch timeout for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372517 (https://phabricator.wikimedia.org/T172930)
[10:23:32] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 032] hieradata: bump ProxyFetch timeout for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372517 (https://phabricator.wikimedia.org/T172930) (owner: 10Filippo Giunchedi)
[10:24:50] <godog>	 _joe_: merging your change too
[10:24:59] <_joe_>	 ouch sorry
[10:25:01] <_joe_>	 thanks
[10:26:46] <godog>	 np!
[10:28:16] <wikibugs_>	 (03PS4) 10ArielGlenn: add user and directory setup to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374242 (https://phabricator.wikimedia.org/T169849)
[10:29:00] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1169 is OK: OK
[10:29:10] <moritzm>	 elukey: that's because you reimaged it and it got installed with the latest kernel
[10:29:37] <moritzm>	 all fine, the 4.9.30 update hasn't been rolled out fleet-wide
[10:30:01] <elukey>	 yep yep 
[10:30:35] <elukey>	 I checked the sosreport and 4.9.25-1~bpo8+3 was installed, so my pebkac
[10:30:39] <elukey>	 :)
[10:31:33] <moritzm>	 !log reimaging  mw1168 (video scaler) to jessie
[10:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:00] <Lucas_WMDE>	 hi, is someone online whom I can ask a few questions about the object cache?
[10:32:41] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] add user and directory setup to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374242 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn)
[10:36:28] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374521 (https://phabricator.wikimedia.org/T161088)
[10:36:59] <icinga-wm>	 PROBLEM - puppet last run on dumpsdata1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/data/xmldatadumps/public/wikidatawiki/entities]
[10:39:21] <wikibugs_>	 (03PS3) 10Matthias Mullie: Add missing THREED2PNG_PATH [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/373595 (https://phabricator.wikimedia.org/T161719)
[10:43:35] <wikibugs_>	 (03PS2) 10Zfilipin: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374071 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm)
[10:45:30] <wikibugs_>	 (03PS1) 10ArielGlenn: manually create wikidatawiki dumps dir on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374522
[10:45:44] <apergos>	 ignore the dumpsdata whine, fixing
[10:47:40] <wikibugs_>	 10Operations, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3561542 (10Urbanecm) p:05Triage>03Low
[10:47:43] <wikibugs_>	 (03PS2) 10ArielGlenn: manually create wikidatawiki dumps dir on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374522
[10:48:00] <moritzm>	 !log installing libxml2 security updates
[10:48:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:37] <wikibugs_>	 10Operations, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3559677 (10Urbanecm) Will create config.
[10:49:34] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] manually create wikidatawiki dumps dir on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374522 (owner: 10ArielGlenn)
[10:51:12] <icinga-wm>	 RECOVERY - puppet last run on dumpsdata1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[10:52:46] <wikibugs_>	 (03PS4) 10Hashar: apt:pin pref file must not have space [puppet] - 10https://gerrit.wikimedia.org/r/353540
[10:52:48] <wikibugs_>	 (03PS1) 10Hashar: apt: spec boiler plate [puppet] - 10https://gerrit.wikimedia.org/r/374527
[10:53:20] <wikibugs_>	 (03CR) 10Hashar: "I have added the boiler plate for a follow up change https://gerrit.wikimedia.org/r/#/c/353540/" [puppet] - 10https://gerrit.wikimedia.org/r/374527 (owner: 10Hashar)
[10:53:35] <wikibugs_>	 (03CR) 10Hashar: "Rebased and amended to add a basic spec test." [puppet] - 10https://gerrit.wikimedia.org/r/353540 (owner: 10Hashar)
[10:54:21] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374521 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui)
[10:55:49] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374521 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui)
[10:55:53] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2065605
[10:56:17] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374521 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui)
[10:57:31] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097 - T161088 (duration: 00m 43s)
[10:57:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:44] <stashbot>	 T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088
[10:58:32] <wikibugs_>	 (03PS1) 10Urbanecm: Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370)
[10:59:57] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) (owner: 10Urbanecm)
[11:02:02] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Decomission db1033 and db1028 [puppet] - 10https://gerrit.wikimedia.org/r/374529 (https://phabricator.wikimedia.org/T174076)
[11:03:23] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Decommission db1033 and db1028 [software] - 10https://gerrit.wikimedia.org/r/374530 (https://phabricator.wikimedia.org/T174076)
[11:04:17] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Decommission db1033 and db1028 [puppet] - 10https://gerrit.wikimedia.org/r/374529 (https://phabricator.wikimedia.org/T174076)
[11:05:36] <wikibugs_>	 (03PS2) 10Urbanecm: Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370)
[11:08:13] <marostegui>	 !log Stop MariaDB on db1097 to migrate it to file per table - T161088
[11:08:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:26] <stashbot>	 T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088
[11:12:55] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Decommission db1033 and db1028 [software] - 10https://gerrit.wikimedia.org/r/374530 (https://phabricator.wikimedia.org/T174076) (owner: 10Jcrespo)
[11:13:07] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Decommission db1033 and db1028 [puppet] - 10https://gerrit.wikimedia.org/r/374529 (https://phabricator.wikimedia.org/T174076) (owner: 10Jcrespo)
[11:15:11] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1168 is CRITICAL: Host mw1168 is not in mediawiki-installation dsh group
[11:15:21] <icinga-wm>	 PROBLEM - DPKG on kubernetes1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:15:21] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1168 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:16:10] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2256 is CRITICAL: Host mw2256 is not in mediawiki-installation dsh group
[11:16:20] <icinga-wm>	 PROBLEM - Disk space on kubernetes1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:16:20] <icinga-wm>	 PROBLEM - Check systemd state on mw1168 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:16:20] <icinga-wm>	 PROBLEM - nutcracker port on mw1168 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:17:01] <icinga-wm>	 PROBLEM - nutcracker process on mw1168 is CRITICAL: Return code of 255 is out of bounds
[11:17:01] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on mw1168 is CRITICAL: Return code of 255 is out of bounds
[11:17:20] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1168 is OK: OK: nf_conntrack is 0 % full
[11:17:51] <icinga-wm>	 PROBLEM - MD RAID on kubernetes1004 is CRITICAL: Return code of 255 is out of bounds
[11:19:00] <icinga-wm>	 RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[11:19:07] <akosiaris>	 kubernetes1004 is known
[11:19:11] <icinga-wm>	 RECOVERY - Disk space on kubernetes1004 is OK: DISK OK
[11:19:16] <akosiaris>	 damn icinga was faster than the reimage
[11:19:20] <icinga-wm>	 RECOVERY - DPKG on kubernetes1004 is OK: All packages OK
[11:19:22] <moritzm>	 ^ mw1168 is a reimage, silencing it
[11:21:30] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 2 minutes ago with 5 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX],Logical_volume[data]
[11:22:20] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:25:26] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational
[11:25:47] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Remove db1028 and db1033 from hiera [puppet] - 10https://gerrit.wikimedia.org/r/374531 (https://phabricator.wikimedia.org/T174076)
[11:26:17] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Remove db1028 and db1033 from hiera [puppet] - 10https://gerrit.wikimedia.org/r/374531 (https://phabricator.wikimedia.org/T174076) (owner: 10Jcrespo)
[11:26:36] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[11:27:11] <wikibugs_>	 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3561653 (10Johan) Thanks for reporting. We've looked into...
[11:28:26] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:29:16] <icinga-wm>	 RECOVERY - nutcracker process on mw1168 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker
[11:29:26] <icinga-wm>	 RECOVERY - nutcracker port on mw1168 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[11:29:26] <icinga-wm>	 RECOVERY - Check systemd state on mw1168 is OK: OK - running: The system is fully operational
[11:29:42] <wikibugs_>	 10Operations, 10ops-esams, 10DNS, 10Traffic, 10netops: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#3561654 (10faidon) 05Open>03Resolved
[11:30:02] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: Decommission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3561656 (10jcrespo)
[11:30:50] <elukey>	 !log restart java daemons on analytics100[1,2] (Hadoop Master nodes) for jvm updates
[11:31:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:48] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1168.eqiad.wmnet
[11:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:55] <wikibugs_>	 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#3561669 (10MoritzMuehlenhoff)
[11:44:57] <wikibugs_>	 10Operations, 10Operations-Software-Development, 10HHVM, 10Patch-For-Review: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536#3561670 (10MoritzMuehlenhoff)
[11:45:01] <wikibugs_>	 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#3561667 (10MoritzMuehlenhoff) 05Open>03Resolved Migration to jessie is completed.
[11:45:16] <wikibugs_>	 10Operations, 10Operations-Software-Development, 10HHVM, 10Patch-For-Review: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536#2571031 (10MoritzMuehlenhoff)
[11:45:29] <wikibugs_>	 10Operations, 10Operations-Software-Development, 10HHVM, 10Patch-For-Review: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536#2571031 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Migration to jessie is complete
[11:47:03] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on mw1168 is OK: OK: synced at Tue 2017-08-29 11:47:01 UTC.
[11:47:06] <elukey>	 moritzm: great work --^ 
[11:47:11] <paravoid>	 yup!
[11:47:15] <wikibugs_>	 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#3561690 (10MoritzMuehlenhoff) All the packaging work for jessie is complete (and the servers in production have been migrated). If deployment-tmh01 is still used it can be re...
[11:48:18] <wikibugs_>	 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3561691 (10Marostegui) How do you guys want to proceed with this in the end? Is it worth the risk?
[11:49:27] <elukey>	 !log restart kafka daemons on kafka1013 for jvm security updates
[11:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:13] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time
[11:55:32] <icinga-wm>	 PROBLEM - HHVM rendering on mw1287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[11:55:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[11:55:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[11:56:02] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time
[11:56:13] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.082 second response time
[11:56:32] <icinga-wm>	 RECOVERY - HHVM rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 74335 bytes in 0.539 second response time
[11:56:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.036 second response time
[11:56:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.458 second response time
[11:57:03] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.072 second response time
[12:00:42] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational
[12:03:42] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:15:13] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1168 is OK: OK
[12:15:23] <wikibugs_>	 (03PS1) 10Elukey: role::analytics::hadoop::master: fix descriptions of HDFS alarms [puppet] - 10https://gerrit.wikimedia.org/r/374536
[12:15:56] <wikibugs_>	 (03CR) 10Elukey: [C: 032] role::analytics::hadoop::master: fix descriptions of HDFS alarms [puppet] - 10https://gerrit.wikimedia.org/r/374536 (owner: 10Elukey)
[12:25:20] <wikibugs_>	 10Operations, 10HHVM: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3561778 (10MoritzMuehlenhoff)
[12:25:52] <wikibugs_>	 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3561794 (10ovasileva)
[12:28:59] <wikibugs_>	 10Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 10Patch-For-Review, and 2 others: Collate wikimedia pages into a single html wikimedia page that can then be rendered into a single pdf - https://phabricator.wikimedia.org/T150874#3561803 (10ovasileva)
[12:31:26] <godog>	 !log bounce pybal on lvs1003 to pick up config changes
[12:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:34] <wikibugs_>	 10Operations, 10Traffic: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3561810 (10fgiunchedi)
[12:37:40] <elukey>	 !log restart kafka daemons on kafka1014 for jvm security updates
[12:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:34] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: ganglia::gmetad::rrdcached: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374353 (https://phabricator.wikimedia.org/T171704)
[12:40:36] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: statistics::wmde: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374538 (https://phabricator.wikimedia.org/T171704)
[12:43:18] <wikibugs_>	 (03PS4) 10Giuseppe Lavagetto: ganglia::gmetad::rrdcached: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374353 (https://phabricator.wikimedia.org/T171704)
[12:44:08] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] ganglia::gmetad::rrdcached: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374353 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto)
[12:46:42] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: statistics::wmde: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374538 (https://phabricator.wikimedia.org/T171704)
[12:55:14] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] statistics::wmde: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374538 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto)
[12:59:59] <zeljkof>	 jouncebot: next
[13:00:00] <jouncebot>	 In 0 hour(s) and 0 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1300)
[13:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1300). Please do the needful.
[13:00:04] <jouncebot>	 Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[13:00:11] <Urbanecm>	 I'm here
[13:00:18] <zeljkof>	 I can SWAT today!
[13:00:24] <wikibugs_>	 (03PS4) 10Ottomata: webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle)
[13:00:26] <wikibugs_>	 (03PS6) 10Ottomata: webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle)
[13:00:58] <zeljkof>	 Urbanecm: ok, looks like there is only one patch, will merge and deploy
[13:01:03] <Urbanecm>	 Great
[13:01:07] <zeljkof>	 I'll ping you when it's at mwdebug
[13:03:21] <wikibugs_>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374071 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm)
[13:04:53] <wikibugs_>	 (03Merged) 10jenkins-bot: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374071 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm)
[13:06:00] <wikibugs_>	 (03CR) 10jenkins-bot: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374071 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm)
[13:06:25] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle)
[13:06:35] <zeljkof>	 Urbanecm: the commit is at mwdebug1002
[13:07:38] <Urbanecm>	 ack
[13:08:13] <Urbanecm>	 zeljkof, please deploy
[13:08:30] <zeljkof>	 Urbanecm: deploying
[13:08:37] <Urbanecm>	 thx
[13:09:09] <logmsgbot>	 !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:374071|Add several HD logos (T150618)]] (duration: 00m 43s)
[13:09:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:22] <stashbot>	 T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618
[13:10:17] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:374071|Add several HD logos (T150618)]] (duration: 00m 43s)
[13:10:23] <zeljkof>	 Urbanecm: deployed
[13:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:37] <Urbanecm>	 thank you
[13:12:33] <zeljkof>	 looks like that is all
[13:12:37] <zeljkof>	 !log EU SWAT finished
[13:12:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:10] <wikibugs_>	 10Operations, 10Discovery, 10Discovery-Analysis, 10Maps, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3561943 (10Gehel) Deployment is scheduled for Thursday August 31.
[13:17:02] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.008 second response time
[13:17:42] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time
[13:18:01] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.043 second response time
[13:18:02] <icinga-wm>	 PROBLEM - HHVM rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[13:18:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[13:19:02] <icinga-wm>	 RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 74416 bytes in 0.191 second response time
[13:19:11] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: zuul: remove configfile define [puppet] - 10https://gerrit.wikimedia.org/r/374541 (https://phabricator.wikimedia.org/T171704)
[13:19:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.072 second response time
[13:19:40] <elukey>	 weird
[13:19:42] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.090 second response time
[13:19:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1282 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time
[13:20:24] <wikibugs_>	 (03PS5) 10Ottomata: webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle)
[13:20:49] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle)
[13:20:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.064 second response time
[13:21:25] <elukey>	 Aug 29 13:18:48 mw1276 systemd[1]: hhvm.service: main process exited, code=killed, status=11/SEGV
[13:21:31] <elukey>	 no bueno
[13:21:43] <volans>	 SWAT was completed not long ago, could be related?
[13:22:14] <wikibugs_>	 (03PS6) 10Ottomata: webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle)
[13:22:32] <wikibugs_>	 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3481107 (10Krenair) Was the maintain-views step not performed? ```MariaDB [hiwikiversity_p]> show tables; Empty set (0.00 sec)```
[13:24:57] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle)
[13:25:25] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] "I refactored a little to use hiera lookups from the role class, rather than the modules, and fixed up some parameters that I think would h" [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle)
[13:25:38] <wikibugs_>	 (03PS7) 10Ottomata: webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle)
[13:25:40] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle)
[13:26:10] <elukey>	 volans: from the stacktrace it seems something related to AbuseFilter, but I have no idea if it could be related
[13:26:11] <icinga-wm>	 PROBLEM - HHVM rendering on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[13:26:33] <icinga-wm>	 PROBLEM - Apache HTTP on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[13:26:57] <elukey>	 zeljkof: hi :)
[13:27:11] <icinga-wm>	 RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 74416 bytes in 0.816 second response time
[13:27:25] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: zuul: remove configfile define [puppet] - 10https://gerrit.wikimedia.org/r/374541 (https://phabricator.wikimedia.org/T171704)
[13:27:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.088 second response time
[13:27:47] <elukey>	 same stacktrace for --^
[13:28:00] <zeljkof>	 elukey: hi
[13:28:15] <wikibugs_>	 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3561980 (10Reedy) >>! In T171829#3553943, @Marostegui wrote: > The blocker is fixed and so is this one too: > ``` > mysql:root@localhost [hiwikiversity_p]> show t...
[13:29:07] <volans>	 seems that the only merged patch was https://gerrit.wikimedia.org/r/#/c/374071/ in the last SWAT
[13:29:21] <_joe_>	 elukey: did you check the caches sizes?
[13:29:25] <elukey>	 zeljkof: after the deployment there seems to be some api hosts reporting segfaults for HHVM and PHP stacktraces containing AbuseFilter
[13:29:31] <elukey>	 _joe_ nope 
[13:29:39] <_joe_>	 1 sec
[13:29:58] <wikibugs_>	 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3561986 (10Marostegui) >>! In T168765#3561969, @Krenair wrote: > Was the maintain-views step not completely performed? > ```MariaDB [hiwikiversity_p]> show tables...
[13:30:00] <_joe_>	 zeljkof: please prepare the revert for any change that could be related to Abusefilter
[13:30:04] <_joe_>	 but do not commit it
[13:30:16] <volans>	 zeljkof: is it normal that the keys are repeated?
[13:30:31] <Krenair>	 ty marostegui 
[13:30:44] <Krenair>	 I was indeed looking at 1003
[13:31:01] <marostegui>	 it is fixed now :)
[13:31:04] <wikibugs_>	 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3561990 (10Marostegui) >>! In T168765#3561986, @Marostegui wrote: >>>! In T168765#3561969, @Krenair wrote: >> Was the maintain-views step not completely performed...
[13:31:46] <volans>	 many of the lines added in that patch were already there
[13:31:58] <_joe_>	 wat?
[13:32:03] <_joe_>	 let's revert, yes?
[13:32:06] <volans>	 https://gerrit.wikimedia.org/r/#/c/374071/3/wmf-config/InitialiseSettings.php
[13:33:04] <Krenair>	 works, ty again
[13:33:14] <_joe_>	 yeah don't think that's relevant
[13:33:21] <zeljkof>	 volans, _joe_: oops, I have deployed only changes to logos, did something break?
[13:33:38] <_joe_>	 I think it's unrelated
[13:33:44] <_joe_>	 but let me dig a bit deeper
[13:33:49] <Reedy>	 Someone probably wrote a shitty rule and/or enabled it
[13:33:55] <Reedy>	 Got the stack trace anywhere?
[13:34:08] <volans>	 it's just duplicated keys in the array, that sounds wrong but unrelated
[13:34:16] <_joe_>	 Reedy: I guess on mw1279 you could have one under /var/core
[13:34:17] <elukey>	 Reedy: yep, it is in /var/log/hhvm/ (last stacktrace.etc..)
[13:34:24] <zeljkof>	 I did not even notice duplicate keys :( my mistake
[13:34:27] <_joe_>	 in /var/log/hhvm, yeah
[13:34:32] <_joe_>	 after it has restarted
[13:34:38] <zeljkof>	 I'm around, let me know if I need to do anything
[13:34:43] <_joe_>	 but yeah zeljkof that's surely unrelated
[13:34:56] <Reedy>	 Almost certainly
[13:35:03] <Reedy>	 I think PHP doesn't even complain about it
[13:35:11] <volans>	 probably just override them
[13:35:19] <Reedy>	 IIRC, it's the last one that wins
[13:35:20] <Reedy>	 elukey: On what host?
[13:35:26] <volans>	 mw1279
[13:36:13] <Reedy>	 That looks like an array of usernames
[13:36:13] <elukey>	 one of the last failed ones, mw1279 should be good
[13:36:58] <Reedy>	 slow query: SELECT /* GenderCache::doQuery/ApiQueryAllPages::run  */  user_name,up_value  FROM `user` LEFT JOIN `user_properties` ON ((user_id = up_user) AND up_property = 'gender')   WHERE user_name IN ('
[13:37:39] <Reedy>	 oh, stacktrace
[13:38:20] <wikibugs_>	 (03PS1) 10Faidon Liambotis: Add sandbox1-esams and ripe-atlas-esams [dns] - 10https://gerrit.wikimedia.org/r/374545
[13:38:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[13:38:45] <_joe_>	 it's happening again
[13:38:58] <wikibugs_>	 10Operations, 10ops-codfw, 10Patch-For-Review: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3561996 (10Papaul) @elukey i think i will take your advice to burn mw2256 down lol. Here is want I want for you to do for me. Configure the system to generate a kernel crash dump. When the syste...
[13:39:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[13:39:11] <wikibugs_>	 10Operations, 10Ops-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T174316#3561997 (10Ottomata) @robh, can you advise here?  What needs to be done for a LDAP request for someone who has signed an NDA?  Is there a place where I can look up the signed NDA to verify?  If I do, can I j...
[13:39:12] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time
[13:39:15] <volans>	 could be that this list of usernames is too long and makes hhvm crash?
[13:39:26] <_joe_>	 which list?
[13:39:27] <volans>	 and the user is re-trying and at every retry one HHVM dies?
[13:39:31] <volans>	 see error.log
[13:39:44] <Reedy>	 hhvm shouldn't be dying from a slow mysql query...
[13:39:45] <wikibugs_>	 (03CR) 10Faidon Liambotis: [C: 032] Add sandbox1-esams and ripe-atlas-esams [dns] - 10https://gerrit.wikimedia.org/r/374545 (owner: 10Faidon Liambotis)
[13:39:55] <volans>	 Reedy: not for the slowness, but for the size of it
[13:40:02] <icinga-wm>	 RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 74418 bytes in 1.871 second response time
[13:40:11] <volans>	 my wild guess
[13:40:12] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.125 second response time
[13:40:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.021 second response time
[13:41:43] <Reedy>	 AbuseFilter doesn't use the API recent changes... So I'd suggest the two are unrelated
[13:41:44] <paravoid>	 jenkins having issues?
[13:42:15] <volans>	 paravoid: not that I know, why?
[13:42:23] <paravoid>	 nah, just took a while
[13:45:30] <wikibugs_>	 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3562003 (10Papaul) @madhuvishy here is what I am about to setup   on Labstore2001   3xRAID10 of 8 disks per logical/virtua...
[13:50:50] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: zuul: remove configfile define [puppet] - 10https://gerrit.wikimedia.org/r/374541 (https://phabricator.wikimedia.org/T171704)
[13:51:42] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] zuul: remove configfile define [puppet] - 10https://gerrit.wikimedia.org/r/374541 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto)
[13:58:58] <wikibugs_>	 (03CR) 10Reedy: "As reported on IRC, a followup of this needs making that fixes the duplicate array keys" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374071 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm)
[13:59:03] <wikibugs_>	 (03PS1) 10Faidon Liambotis: Add mgmt IPs for cr3-esams, new MX480 [dns] - 10https://gerrit.wikimedia.org/r/374548
[14:00:09] <wikibugs_>	 (03CR) 10Faidon Liambotis: [C: 032] Add mgmt IPs for cr3-esams, new MX480 [dns] - 10https://gerrit.wikimedia.org/r/374548 (owner: 10Faidon Liambotis)
[14:06:10] <wikibugs_>	 (03PS5) 10Gehel: elasticsearch - switch to using logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373509
[14:08:29] <wikibugs_>	 (03PS6) 10Gehel: elasticsearch - switch to using logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373509
[14:09:16] <wikibugs_>	 (03CR) 10Gehel: [C: 032] elasticsearch - switch to using logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373509 (owner: 10Gehel)
[14:12:58] <wikibugs_>	 (03PS3) 10Gehel: apertium - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373510
[14:13:07] <wikibugs_>	 (03PS4) 10Gehel: apertium - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373510
[14:13:12] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] apertium - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373510 (owner: 10Gehel)
[14:15:46] <wikibugs_>	 (03PS3) 10Gehel: base - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373515
[14:16:00] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] base - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373515 (owner: 10Gehel)
[14:16:12] <wikibugs_>	 (03CR) 10Gehel: apertium - switch to logrotate::rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/373510 (owner: 10Gehel)
[14:16:24] <wikibugs_>	 (03PS4) 10Gehel: base - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373515
[14:17:01] <wikibugs_>	 (03CR) 10Gehel: "The convention in puppet is to use snake case, I'd prefer to follow it..." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/373515 (owner: 10Gehel)
[14:18:36] <wikibugs_>	 (03PS3) 10Gehel: camus - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373516
[14:18:45] <wikibugs_>	 (03PS4) 10Gehel: camus - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373516
[14:23:56] <wikibugs_>	 (03CR) 10Volans: [C: 031] "LGTM, being the base module please double check it with the puppet compiler to be sure ;)" [puppet] - 10https://gerrit.wikimedia.org/r/373515 (owner: 10Gehel)
[14:24:13] <wikibugs_>	 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3562108 (10Cmjohnson) moved the fibers to cr1/2 to xe-3/1/7 swapped the cable with fiber connecting pfw-3a and pfw-3b (xe-1/0/17) Moved connections pfw-3a/b and fasw-c1a/...
[14:32:33] <wikibugs_>	 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3562138 (10ovasileva)
[14:33:28] <marostegui>	 !log Shutdown db1055 to replace its BBU - T174265
[14:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:43] <stashbot>	 T174265: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265
[14:33:45] <wikibugs_>	 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2799650 (10ovasileva) @GWicke - timeline now updated in task description.  OCG switching will be done by the en...
[14:35:40] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3562150 (10Cmjohnson) HP requested the AHS log, uploaded the log to their system. Waiting on their response.  Only working with 1006 at the moment.
[14:47:31] <icinga-wm>	 PROBLEM - Host db1055.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:48:13] <marostegui>	 ^ that is expected
[14:51:18] <elukey>	 !log drop log.MobileWebUIClickTracking_10742159_15423246 from db1047 (archived on HDFS) - T172322
[14:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:31] <stashbot>	 T172322: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322
[14:52:21] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2074714
[14:52:41] <icinga-wm>	 RECOVERY - Host db1055.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms
[14:54:31] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2027261
[14:56:01] <wikibugs_>	 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3483663 (10chasemp) >>! In T168765#3561990, @Marostegui wrote: >>>! In T168765#3561986, @Marostegui wrote: >>>>! In T168765#3561969, @Krenair wrote: >>> Was the m...
[14:56:32] <wikibugs_>	 (03PS3) 10Filippo Giunchedi: install_server: add partman for cassandra JBOD [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939)
[15:03:03] <icinga-wm>	 PROBLEM - Host cp3035.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:03:03] <icinga-wm>	 PROBLEM - Host cp3034.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:03:24] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265#3562203 (10Marostegui) The BBU has been replaced and looks good: ``` root@db1055:/home/marostegui# megacli -AdpBbuCmd  -a0  BBU status for Adapter: 0  BatteryType: BBU Voltage: 3918...
[15:03:52] <icinga-wm>	 PROBLEM - Host cp3033.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:06:28] <godog>	 mark paravoid ^ 3x cp down, expected?
[15:06:33] <godog>	 esams cp that is
[15:06:52] <icinga-wm>	 PROBLEM - Host cp3039.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:06:56] <paravoid>	 mark is on-site, but that's not expected no
[15:07:00] <mark>	 it is
[15:07:15] <paravoid>	 these are in production, I think?
[15:07:23] <mark>	 it's mgmt
[15:07:27] <paravoid>	 oh duh
[15:07:32] <mark>	 i unplugged one link to remove a switch
[15:07:37] <godog>	 doh, my bad
[15:07:38] <paravoid>	 duh sorry
[15:07:44] <mark>	 nice that we can see that these days ;)
[15:08:11] <icinga-wm>	 RECOVERY - Host cp3035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.72 ms
[15:08:12] <icinga-wm>	 RECOVERY - Host cp3034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.75 ms
[15:08:27] <godog>	 I have to train eyes to read the full hostname now heh
[15:09:01] <icinga-wm>	 RECOVERY - Host cp3033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.66 ms
[15:12:02] <icinga-wm>	 RECOVERY - Host cp3039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.41 ms
[15:12:29] <wikibugs_>	 10Operations, 10Ops-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T174316#3562234 (10Samtar) If at all helpful, the NDA sent through "cobblestone" was named "WMF-Sam Tarling Volunteer NDA January 2017" and has a reference number `832b7db5-a494-43ac-8895-c3ba6837e18d`
[15:14:57] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: puppetmaster::passenger: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374553 (https://phabricator.wikimedia.org/T171704)
[15:14:57] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: labspuppetmaster: fix array interpolation in strings [puppet] - 10https://gerrit.wikimedia.org/r/374554 (https://phabricator.wikimedia.org/T171704)
[15:17:11] <wikibugs_>	 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3562267 (10madhuvishy) @Papaul Yup that's perfect, thanks!
[15:20:29] <wikibugs_>	 10Operations, 10LDAP-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T174316#3562274 (10RobH) 05Open>03Resolved a:03RobH So Ldap requests are not the same as ops requests, a few things has to happen:  * swapping this from #ops-access-requests to #ldap-access-requests - done *...
[15:20:49] <wikibugs_>	 10Operations, 10LDAP-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T174316#3562278 (10RobH) a:05RobH>03None
[15:24:14] <wikibugs_>	 (03CR) 10Eevans: "> > So this is ~50G for the data raid-1?  If so, that seems to be" [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi)
[15:31:37] <wikibugs_>	 (03PS19) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur)
[15:32:37] <wikibugs_>	 (03PS1) 10Alexandros Kosiaris: kubelet: Remove configure-cbr0 parameter [puppet] - 10https://gerrit.wikimedia.org/r/374556 (https://phabricator.wikimedia.org/T170119)
[15:33:42] <wikibugs_>	 (03PS2) 10Alexandros Kosiaris: kubelet: Remove configure-cbr0 parameter [puppet] - 10https://gerrit.wikimedia.org/r/374556 (https://phabricator.wikimedia.org/T170119)
[15:35:24] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: "Added WMCS people so they are aware this is going out. I expect it to not cause any problems as it is a tested noop in toollabs. I am prob" [puppet] - 10https://gerrit.wikimedia.org/r/374556 (https://phabricator.wikimedia.org/T170119) (owner: 10Alexandros Kosiaris)
[15:48:03] <wikibugs_>	 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3562348 (10ayounsi)
[15:48:13] <wikibugs_>	 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3554735 (10ayounsi)
[15:49:31] <wikibugs_>	 (03CR) 10Hashar: "Booted a new instance and it works:" [puppet] - 10https://gerrit.wikimedia.org/r/369873 (owner: 10Hashar)
[15:54:07] <urandom>	 !log T169939: Decommission Cassandra: restbase2005-c.codfw.wmnet
[15:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:19] <stashbot>	 T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939
[15:54:31] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 422
[15:59:45] <wikibugs_>	 10Operations, 10ops-eqiad, 10Release-Engineering-Team: tin has a failing hdd - https://phabricator.wikimedia.org/T174449#3562401 (10RobH)
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1600).
[16:00:58] <elukey>	 seems no patches scheduled
[16:08:21] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Implement regular logical backups using mydumper [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516)
[16:08:46] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Implement regular logical backups using mydumper [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo)
[16:12:39] <wikibugs_>	 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3562459 (10chasemp) a:03madhuvishy
[16:13:11] <wikibugs_>	 10Operations, 10Release-Engineering-Team, 10hardware-requests: eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#3562461 (10RobH)
[16:14:06] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Implement regular logical backups using mydumper [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516)
[16:14:48] <wikibugs_>	 (03Abandoned) 10Jcrespo: [WIP]mariadb: First attempt at a mydumper-based dump script [puppet] - 10https://gerrit.wikimedia.org/r/371944 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo)
[16:16:23] <wikibugs_>	 (03CR) 10Jcrespo: "The script is mostly empty, but this would be a skeleton of the puppet code, with minimal run of mydumper (but not much functionality)." [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo)
[16:28:42] <icinga-wm>	 PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:35:01] <wikibugs_>	 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3562523 (10madhuvishy) @Marostegui We talked about this today in our meeting, and think that since we don't have significant user traffic moved over from 1001...
[16:38:09] <wikibugs_>	 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3562543 (10bd808) I vote we close this as "resolved" with a note that 1001/3 have not been rebooted because of the fear of catastrophic hardware failure and t...
[16:38:15] <davidwbarratt>	 bd808 so... do you know how I would get access to non _p databases? https://phabricator.wikimedia.org/T170717#3560460
[16:39:11] <icinga-wm>	 RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[16:39:13] <bd808>	 davidwbarratt: what db server are you trying to do this on?
[16:39:39] <davidwbarratt>	 bd808 db1095
[16:41:52] <icinga-wm>	 PROBLEM - Host labstore2002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:42:44] <bd808>	 hmmm.... db1095 is a sanatarium host. I don't think that's where you should be going for raw data.
[16:43:17] <bd808>	 Just run `sql enwiki` from terbium and you should be connected to a full replica server
[16:43:38] <davidwbarratt>	 bd808 ah! thanks!
[16:44:09] <davidwbarratt>	 bd808 and that worked! thanks!
[16:44:20] <bd808>	 yw!
[16:45:08] <bd808>	 the sanatarium boxes are special. they filter out the private data that can't go into the cloud services replicas
[16:48:11] <davidwbarratt>	 bd808 ah. got it. thanks!
[16:51:46] <wikibugs_>	 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3562604 (10Papaul) @madhuvishy done
[16:53:10] <wikibugs_>	 (03CR) 10Smalyshev: [C: 031] wdqs - logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel)
[16:53:21] <icinga-wm>	 RECOVERY - Host labstore2002 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms
[16:54:12] <wikibugs_>	 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3562639 (10Marostegui) >>! In T168584#3562523, @madhuvishy wrote: > @Marostegui We talked about this today in our meeting, and think that since we don't have...
[16:57:27] <wikibugs_>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562647 (10RobH)
[16:57:56] <wikibugs_>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: send wdqs logs to logstash - https://phabricator.wikimedia.org/T172710#3562695 (10EBernhardson) >>! In T172710#3561315, @Gehel wrote: > Logs are now sent to logstash, but the "host" field isn't set correctly (its value i...
[16:59:56] <wikibugs_>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: send wdqs logs to logstash - https://phabricator.wikimedia.org/T172710#3562704 (10Gehel) Oh, I was expecting %{HOSTNAME} to be interpreted by logstash itself, not as a ref in the same document. There is something about H...
[17:00:05] <jouncebot>	 gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1700).
[17:00:17] <awight>	 Nothing for ORES!
[17:10:07] <wikibugs_>	 10Operations, 10Release-Engineering-Team, 10hardware-requests: eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#3562755 (10RobH) Please note that there has been some IRC discussion.  The relevance of moving deployment to ganeti was discussed on T144578.  Additionally, @bd8...
[17:12:09] <wikibugs_>	 (03CR) 10GWicke: "I am a bit concerned about doubling the write bandwidth consumed by journal writes. IIRC Cassandra journals contain full data, so this is " [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi)
[17:13:47] <wikibugs_>	 10Operations, 10Release-Engineering-Team, 10hardware-requests: eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#3562461 (10demon) >>! In T174452#3562755, @RobH wrote: > I'd advise we try out the spare system with the SATA disks and see how well it works.  Its a higher perf...
[17:20:00] <Reedy>	 !log Disabled oathauth for KartikMistry on wikitech
[17:20:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:43] <wikibugs_>	 10Operations, 10Traffic: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3561810 (10BBlack) Are the non-icmp graphs somehow LVS-specific?  My past impression of such graphs is that they aren't, and it just happens to be the case that the bulk of the LVS hos...
[17:21:17] <wikibugs_>	 (03PS1) 10BBlack: browsersec: update es translation [puppet] - 10https://gerrit.wikimedia.org/r/374584
[17:21:19] <wikibugs_>	 (03PS1) 10BBlack: browsersec: update ar translation [puppet] - 10https://gerrit.wikimedia.org/r/374585
[17:26:21] <icinga-wm>	 PROBLEM - MegaRAID on db1048 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
[17:26:52] <icinga-wm>	 PROBLEM - Host mw1228 is DOWN: PING CRITICAL - Packet loss = 100%
[17:27:45] <wikibugs_>	 (03PS1) 10Chad: group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374587
[17:29:27] <wikibugs_>	 10Operations, 10cloud-services-team: Reboots of cloud servers - https://phabricator.wikimedia.org/T168445#3562862 (10madhuvishy)
[17:29:31] <wikibugs_>	 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3562860 (10madhuvishy) 05Open>03Resolved 1001/3 have not been rebooted because of the fear of catastrophic hardware failure and their impending decomm.
[17:29:38] <wikibugs_>	 (03CR) 10Chad: [C: 04-2] "not 4 now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374587 (owner: 10Chad)
[17:30:09] <logmsgbot>	 !log demon@tin Started scap: bootstrap wmf.16
[17:30:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:26] <wikibugs_>	 10Operations, 10ops-eqiad: Broken disk on mw1228 - https://phabricator.wikimedia.org/T168613#3562868 (10Cmjohnson) The disks have been swapped needs re-install
[17:30:38] <wikibugs_>	 10Operations, 10Analytics: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3562875 (10Ottomata)
[17:32:02] <icinga-wm>	 RECOVERY - Host mw1228 is UP: PING WARNING - Packet loss = 44%, RTA = 0.28 ms
[17:32:32] <wikibugs_>	 (03CR) 10BBlack: [C: 032] browsersec: update es translation [puppet] - 10https://gerrit.wikimedia.org/r/374584 (owner: 10BBlack)
[17:33:10] <awight>	 I’d like to run lsof on scb1002 to troubleshoot an ORES issue.  Is there any way to get this sudo permission temporarily, or can someone with root paste me the results?  <— akosiaris 
[17:35:41] <wikibugs_>	 (03PS1) 10Reedy: Add techconductwiki to wgCanonicalServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374589 (https://phabricator.wikimedia.org/T174447)
[17:35:55] <Reedy>	 jouncebot: next
[17:35:55] <jouncebot>	 In 1 hour(s) and 24 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1900)
[17:35:57] <Reedy>	 jouncebot: now
[17:35:57] <jouncebot>	 For the next 0 hour(s) and 24 minute(s): Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1700)
[17:36:05] <wikibugs_>	 (03CR) 10Reedy: [C: 032] Add techconductwiki to wgCanonicalServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374589 (https://phabricator.wikimedia.org/T174447) (owner: 10Reedy)
[17:36:28] <halfak>	 awight, why lsof on scb1002?
[17:36:37] <halfak>	 We've been running capacity tests on ores100*
[17:37:25] <awight>	 I was trying to get a sense of FD consumption during regular load
[17:37:44] <awight>	 You might be right that looking at the stressed machines would be more interesting, though
[17:37:48] <wikibugs_>	 (03Merged) 10jenkins-bot: Add techconductwiki to wgCanonicalServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374589 (https://phabricator.wikimedia.org/T174447) (owner: 10Reedy)
[17:37:56] <awight>	 cool—and if we have root, all the better.
[17:37:58] <wikibugs_>	 (03CR) 10jenkins-bot: Add techconductwiki to wgCanonicalServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374589 (https://phabricator.wikimedia.org/T174447) (owner: 10Reedy)
[17:38:53] <awight>	 halfak: I have no root there.  Mind if I request?
[17:39:11] <halfak>	 +1
[17:39:32] <halfak>	 root might be overkill but we should be able to add it to the wheel stuff. 
[17:40:19] <halfak>	 lol 
[17:40:27] * halfak can't view the sudoers file on scb1002
[17:40:52] <wikibugs_>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562940 (10RobH)
[17:45:18] <wikibugs_>	 (03PS2) 10BBlack: browsersec: update ar translation [puppet] - 10https://gerrit.wikimedia.org/r/374585 (https://phabricator.wikimedia.org/T163251)
[17:45:57] <wikibugs_>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562979 (10RobH) So I realized that: d-i     partman-auto/choose_recipe      es  was in the recipe, and isn't needed since it doe...
[17:46:31] <wikibugs_>	 (03CR) 10BBlack: [C: 032] browsersec: update ar translation [puppet] - 10https://gerrit.wikimedia.org/r/374585 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack)
[17:47:25] <wikibugs_>	 (03PS1) 10Awight: Let scoring platform team run "lsof" for diagnostics [puppet] - 10https://gerrit.wikimedia.org/r/374593 (https://phabricator.wikimedia.org/T174402)
[17:47:39] <awight>	 akosiaris: ^ I would love to have that permission
[17:50:43] <wikibugs_>	 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3563013 (10leila) >>! In T163251#3561653, @Johan wrote: >...
[17:52:07] <wikibugs_>	 10Operations, 10Cloud-Services: Register to Wikitech - https://phabricator.wikimedia.org/T174469#3563019 (10Vacio)
[17:57:15] <wikibugs_>	 10Operations, 10ORES, 10Scoring-platform-team-Backlog, 10Patch-For-Review, 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3563049 (10awight) #operations I would like lsof permissions on the ORES boxes, https://gerrit.wiki...
[17:57:18] <wikibugs_>	 10Operations, 10Cloud-Services: Register to Wikitech - https://phabricator.wikimedia.org/T174469#3563019 (10madhuvishy) @Vacio Could you please elaborate on what the problem is? Did you try signing up to wikitech and did you run into an error? If so what? You can create a wikitech account at https://wikitech.w...
[17:59:07] <wikibugs_>	 (03CR) 10Hoo man: [C: 04-1] "-1 for now, until concerns about the link item client widget have been looked into (see T174345#3558741)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374328 (https://phabricator.wikimedia.org/T174345) (owner: 10Urbanecm)
[17:59:24] <wikibugs_>	 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3563103 (10Eevans)
[18:11:53] <wikibugs_>	 10Operations, 10Dumps-Generation, 10Patch-For-Review: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3563181 (10madhuvishy) @ArielGlenn Sounds good, I would push towards a larger window of atleast 2 hours - 45 minutes to an hour for 3 rsyncs + some cleanup se...
[18:15:29] <urandom>	 !log T169939: Decommission Cassandra: restbase1008-a.eqiad.wmnet
[18:15:36] <logmsgbot>	 !log demon@tin Finished scap: bootstrap wmf.16 (duration: 45m 27s)
[18:15:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:42] <stashbot>	 T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939
[18:15:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:48] <wikibugs_>	 10Operations, 10Release-Engineering-Team, 10Category, 10Epic, 10Services (watching): FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453#3563197 (10JAufrecht) Adding an extra tag to support Phlogiston reporting experiments.
[18:17:41] <Pchelolo>	 !log depooling restbase1008,1009,1010,2003,2005 while cluster reshaping is going on
[18:17:48] <wikibugs_>	 (03PS2) 10Urbanecm: Restrict merging rights to autoconfirmed users on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374328 (https://phabricator.wikimedia.org/T174345)
[18:17:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:01] <wikibugs_>	 (03PS4) 10Urbanecm: Enable SandboxLink on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372531 (https://phabricator.wikimedia.org/T173054)
[18:19:44] <robh>	 !log attempting firmware upgrade on scs-a8-eqiad
[18:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:57] <wikibugs_>	 (03PS1) 10BBlack: Varnish: reload VCL on error page changes [puppet] - 10https://gerrit.wikimedia.org/r/374601
[18:21:59] <wikibugs_>	 (03PS1) 10BBlack: browsersec: add fa translation [puppet] - 10https://gerrit.wikimedia.org/r/374602 (https://phabricator.wikimedia.org/T163251)
[18:25:23] <wikibugs_>	 (03CR) 10BBlack: [C: 032] Varnish: reload VCL on error page changes [puppet] - 10https://gerrit.wikimedia.org/r/374601 (owner: 10BBlack)
[18:25:27] <wikibugs_>	 (03CR) 10BBlack: [C: 032] browsersec: add fa translation [puppet] - 10https://gerrit.wikimedia.org/r/374602 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack)
[18:27:06] <wikibugs_>	 10Operations, 10DC-Ops, 10netops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563233 (10RobH)
[18:29:15] <wikibugs_>	 (03PS1) 10BBlack: browsersec: add missing dir=rtl for fa [puppet] - 10https://gerrit.wikimedia.org/r/374604 (https://phabricator.wikimedia.org/T163251)
[18:29:55] <wikibugs_>	 (03CR) 10BBlack: [C: 032] browsersec: add missing dir=rtl for fa [puppet] - 10https://gerrit.wikimedia.org/r/374604 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack)
[18:30:36] <wikibugs_>	 10Operations, 10DC-Ops, 10netops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563264 (10RobH) This was triggered by an error that both @ayounsi and I experienced attempting to connect to https://scs-oe11-esams.mgmt.esams.wmnet/  On FF and Chrome, it gives the error:   ``` Sec...
[18:32:13] <wikibugs_>	 10Operations, 10DC-Ops, 10netops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563267 (10RobH)
[18:36:21] <icinga-wm>	 RECOVERY - MegaRAID on db1048 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
[18:36:39] <andrewbogott>	 godog: if I understand https://phabricator.wikimedia.org/T169860 correctly, new tests should be defined within the prometheus module itself?  (This contrasted with icinga where the tests are defined for the host-to-be-tested rather than on the host-that-runs-the-tests)
[18:48:07] <wikibugs_>	 10Operations, 10DC-Ops, 10netops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563323 (10RobH)
[18:48:27] <wikibugs_>	 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Patch-For-Review: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3563325 (10Sharvaniharan) 05Resolved>03Open @herron I accidentally overwrote my ssh public key . I am so sorry . Attached is my new...
[18:53:58] <wikibugs_>	 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563341 (10RobH)
[18:58:37] <wikibugs_>	 (03PS1) 10BBlack: browsersec: re-order languages slightly [puppet] - 10https://gerrit.wikimedia.org/r/374605 (https://phabricator.wikimedia.org/T163251)
[19:00:05] <jouncebot>	 RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1900).
[19:00:34] <wikibugs_>	 (03PS1) 10ArielGlenn: copy of completed dump files plus metadata from dumpsdata to web server [puppet] - 10https://gerrit.wikimedia.org/r/374606 (https://phabricator.wikimedia.org/T169849)
[19:00:36] <wikibugs_>	 (03CR) 10BBlack: [C: 032] browsersec: re-order languages slightly [puppet] - 10https://gerrit.wikimedia.org/r/374605 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack)
[19:00:50] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] copy of completed dump files plus metadata from dumpsdata to web server [puppet] - 10https://gerrit.wikimedia.org/r/374606 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn)
[19:01:22] <logmsgbot>	 !log ppchelko@tin Started deploy [restbase/deploy@7f2e55f]: Update CXServer endpoints config
[19:01:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:36] <ottomata>	 !log restarting main-eqiad -> analytics kafka mirror maker processes on analytics kafka brokers, something is not working...
[19:02:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:09] <wikibugs_>	 10Operations, 10Dumps-Generation, 10Patch-For-Review: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3563386 (10ArielGlenn) I'm hoping to avoid the bwlimit option, I use this in our current setup but it's a hard cap even when there's not use of the interface...
[19:08:11] <logmsgbot>	 !log ppchelko@tin Finished deploy [restbase/deploy@7f2e55f]: Update CXServer endpoints config (duration: 06m 48s)
[19:08:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:37] <wikibugs_>	 (03CR) 10Chad: [C: 032] group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374587 (owner: 10Chad)
[19:09:51] <wikibugs_>	 (03PS1) 10RobH: new ssh pub key for Sharvani Haran [puppet] - 10https://gerrit.wikimedia.org/r/374607 (https://phabricator.wikimedia.org/T173886)
[19:10:10] <wikibugs_>	 (03Merged) 10jenkins-bot: group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374587 (owner: 10Chad)
[19:10:21] <wikibugs_>	 (03CR) 10jenkins-bot: group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374587 (owner: 10Chad)
[19:10:26] <robh>	 ottomata: you are clinic this week right?
[19:10:35] <robh>	 you forgot to update topic in here, i shall now =]
[19:12:59] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.16
[19:13:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:17] <wikibugs_>	 (03CR) 10Sharvaniharan: [C: 031] new ssh pub key for Sharvani Haran [puppet] - 10https://gerrit.wikimedia.org/r/374607 (https://phabricator.wikimedia.org/T173886) (owner: 10RobH)
[19:14:32] <wikibugs_>	 (03CR) 10RobH: [C: 032] new ssh pub key for Sharvani Haran [puppet] - 10https://gerrit.wikimedia.org/r/374607 (https://phabricator.wikimedia.org/T173886) (owner: 10RobH)
[19:16:50] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[19:17:56] <ottomata>	 haha, i did, thanks robh
[19:18:00] <wikibugs_>	 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Patch-For-Review: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3563500 (10RobH) 05Open>03Resolved fixed and new pubkey is live.   change was confirmed as valid via user update into phab, as well...
[19:18:01] <wikibugs_>	 (03PS1) 10Ottomata: Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610
[19:18:05] <robh>	 no worries
[19:19:18] <wikibugs_>	 (03PS2) 10Ottomata: Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610
[19:19:49] <wikibugs_>	 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Patch-For-Review: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3563513 (10Sharvaniharan) Thank you so much @RobH . You are a lifesaver!
[19:21:00] <wikibugs_>	 (03CR) 10Ppchelko: [C: 031] Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610 (owner: 10Ottomata)
[19:21:59] <ottomata>	 Pchelolo:  i think i also need to set max.request.size in the mirror maker producer configs
[19:22:01] <wikibugs_>	 10Operations, 10hardware-requests, 10Release-Engineering-Team (Watching / External): eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#3563524 (10greg)
[19:22:04] <wikibugs_>	 10Operations, 10ops-eqiad, 10Release-Engineering-Team (Watching / External): tin has a failing hdd - https://phabricator.wikimedia.org/T174449#3563527 (10greg)
[19:26:39] <wikibugs_>	 (03PS3) 10Ottomata: Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610
[19:28:26] <wikibugs_>	 (03PS4) 10Ottomata: Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610
[19:29:20] <wikibugs_>	 (03PS5) 10Ottomata: Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610
[19:29:31] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[19:31:27] <wikibugs_>	 (03CR) 10Ottomata: "Ok, looks good: https://puppet-compiler.wmflabs.org/compiler02/7650/" [puppet] - 10https://gerrit.wikimedia.org/r/374610 (owner: 10Ottomata)
[19:31:29] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610 (owner: 10Ottomata)
[19:34:16] <ottomata>	 !log restarting all kafka brokers and mirror maker processes to apply https://gerrit.wikimedia.org/r/#/c/374610/
[19:34:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:38] <wikibugs_>	 (03PS1) 10Cmjohnson: Removing mgmt dns entries for mw1170-1179 T167130 [dns] - 10https://gerrit.wikimedia.org/r/374614
[19:36:42] <awight>	 ottomata: Do you have a minute for https://gerrit.wikimedia.org/r/#/c/374593/ ?
[19:36:45] <wikibugs_>	 (03PS2) 10Cmjohnson: Removing mgmt dns entries for mw1170-1179 T167130 [dns] - 10https://gerrit.wikimedia.org/r/374614
[19:36:57] <ottomata>	 awight:  in the middle of restarting stuff to fix broken eventstreams
[19:37:00] <ottomata>	 will look in a bit
[19:37:06] <awight>	 thanks!
[19:37:38] <wikibugs_>	 (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for mw1170-1179 T167130 [dns] - 10https://gerrit.wikimedia.org/r/374614 (owner: 10Cmjohnson)
[19:38:11] <icinga-wm>	 PROBLEM - Check systemd state on mw1259 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:41:53] <moritzm>	 ^ mw1259 fixed
[19:42:04] <moritzm>	 remnant of reimage process
[19:42:11] <icinga-wm>	 RECOVERY - Check systemd state on mw1259 is OK: OK - running: The system is fully operational
[19:42:20] <icinga-wm>	 PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[jobchron]
[19:44:21] <icinga-wm>	 RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[19:47:35] <urandom>	 !log T169939: Decommission Cassandra: restbase1008-b.eqiad.wmnet
[19:47:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:46] <stashbot>	 T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939
[19:50:51] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[19:51:00] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.134 and port 9042: Connection refused
[19:51:10] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[19:51:20] <icinga-wm>	 PROBLEM - Check systemd state on restbase2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:51:27] <urandom>	 looking ^^^
[19:53:05] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on restbase2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans Decommissioned host
[19:53:05] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.134 and port 9042: Connection refused eevans Decommissioned host
[19:53:05] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned host
[19:53:05] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a service on restbase2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed eevans Decommissioned host
[19:55:49] <wikibugs_>	 (03PS3) 10Ppchelko: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac)
[19:56:31] <wikibugs_>	 (03PS1) 10Ottomata: Set fetch.message.max.bytes for mirror maker consumers [puppet] - 10https://gerrit.wikimedia.org/r/374619
[19:57:18] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac)
[19:58:39] <wikibugs_>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/7651/" [puppet] - 10https://gerrit.wikimedia.org/r/374619 (owner: 10Ottomata)
[19:58:42] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] Set fetch.message.max.bytes for mirror maker consumers [puppet] - 10https://gerrit.wikimedia.org/r/374619 (owner: 10Ottomata)
[19:58:48] <wikibugs_>	 (03PS4) 10Ppchelko: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac)
[20:12:25] <wikibugs_>	 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3563777 (10ayounsi)
[20:13:19] <wikibugs_>	 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3563783 (10Cmjohnson)  mw1307-1328 are racked, idrac setup, mgmt dns and switch ports configured.    [x] receive in system on procurement task T159963 [x] bios/drac/serial setup/test...
[20:14:08] <wikibugs_>	 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3563786 (10Cmjohnson)
[20:22:31] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:23:10] <icinga-wm>	 PROBLEM - trendingedits endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:23:21] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy
[20:23:41] <wikibugs_>	 (03CR) 10Eevans: "> I am a bit concerned about doubling the write bandwidth consumed by" [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi)
[20:24:01] <icinga-wm>	 RECOVERY - trendingedits endpoints health on scb1001 is OK: All endpoints are healthy
[20:28:59] <wikibugs_>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3563854 (10RobH) Ok, so putting the recipe info to ignore noswap requires:  partman-basicfilesystems partman-basicfilesystems/no_...
[20:29:02] <wikibugs_>	 (03PS2) 10Rush: openstack: remove legacy firewall rules for controller [puppet] - 10https://gerrit.wikimedia.org/r/374424 (https://phabricator.wikimedia.org/T171494)
[20:29:05] <wikibugs_>	 (03PS2) 10Rush: openstack: remove redis replication rule [puppet] - 10https://gerrit.wikimedia.org/r/374427 (https://phabricator.wikimedia.org/T171494)
[20:29:58] <wikibugs_>	 10Operations, 10Mail: mail.wikimedia.org SSL cert expiring Mon 23 Oct 2017 - https://phabricator.wikimedia.org/T174081#3563870 (10herron) Looking into how to renew this using let's encrypt.  The globalsign cert used today is configured with attributes:     CN=mail.wikimedia.org   SAN=cert mail.wikimedia.org, m...
[20:30:40] <wikibugs_>	 (03CR) 10Rush: [C: 032] openstack: remove legacy firewall rules for controller [puppet] - 10https://gerrit.wikimedia.org/r/374424 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[20:31:57] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@ed0fadc]: Release a redis-based deduplicator in test mode
[20:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:40] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:33:40] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational
[20:33:48] <wikibugs_>	 (03CR) 10Rush: [C: 032] openstack: remove redis replication rule [puppet] - 10https://gerrit.wikimedia.org/r/374427 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[20:36:26] <logmsgbot>	 !log ppchelko@tin Finished deploy [changeprop/deploy@ed0fadc]: Release a redis-based deduplicator in test mode (duration: 04m 28s)
[20:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:21] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@a57c79d]: Release a redis-based deduplicator in test mode. Attempt 2
[20:37:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:38] <logmsgbot>	 !log ppchelko@tin Finished deploy [changeprop/deploy@a57c79d]: Release a redis-based deduplicator in test mode. Attempt 2 (duration: 01m 17s)
[20:38:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:45] <godog>	 !log bounce varnish on cp1074 / cp1099 / cp1072 - mailbox lag
[20:49:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:01] <wikibugs_>	 (03PS2) 10Ayounsi: Icinga: Add basic monitoring for routers' active RE [puppet] - 10https://gerrit.wikimedia.org/r/374435 (https://phabricator.wikimedia.org/T174397)
[20:52:30] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0
[20:54:50] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[20:54:52] <wikibugs_>	 (03CR) 10Ayounsi: [C: 032] Icinga: Add basic monitoring for routers' active RE [puppet] - 10https://gerrit.wikimedia.org/r/374435 (https://phabricator.wikimedia.org/T174397) (owner: 10Ayounsi)
[20:54:58] <wikibugs_>	 (03PS3) 10Ayounsi: Icinga: Add basic monitoring for routers' active RE [puppet] - 10https://gerrit.wikimedia.org/r/374435 (https://phabricator.wikimedia.org/T174397)
[20:56:10] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0
[20:57:49] <wikibugs_>	 (03CR) 10Eevans: "> > I am a bit concerned about doubling the write bandwidth consumed" [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi)
[21:05:09] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[21:05:29] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[21:05:29] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[21:06:32] <chasemp>	 andrewbogott: fullstack kicked a few instances in a row^
[21:06:41] <andrewbogott>	 dang
[21:06:46] <chasemp>	 seems weird puppet run related, it's doing that same thing w/ not being able to parse the puppet run output
[21:07:11] <chasemp>	   File "/usr/local/sbin/nova-fullstack", line 562, in <module>
[21:07:11] <chasemp>	     main()
[21:07:11] <chasemp>	   File "/usr/local/sbin/nova-fullstack", line 536, in main
[21:07:13] <chasemp>	     for k, v in puppetrun[d].iteritems():
[21:07:15] <chasemp>	 KeyError: 'changes'
[21:07:45] <chasemp>	 andrewbogott: I did merge those 2 rule cleanups although i can't for the life of figure out how it would be related fyi
[21:08:58] <andrewbogott>	 chasemp: where did you get that paste?  Is it in the boot log on horizon?
[21:09:12] <chasemp>	 andrewbogott: from /var/log/upstart/nova-fullstack.log
[21:09:17] <chasemp>	 on labnet1001
[21:10:38] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[21:10:47] <andrewbogott>	 seems like the VMs are actually broken, so it's not entirely the test's fault...
[21:11:01] <chasemp>	 that recovery is probably me running a manual test 
[21:13:02] <andrewbogott>	 I think it's dns that's failing
[21:13:05] <chasemp>	 andrewbogott: my manual test says
[21:13:06] <chasemp>	 2017-08-29 21:12:35,328 DEBUG sudo: unable to resolve host manual-fullstack-1504041011
[21:13:06] <andrewbogott>	 and the instance is actually fine...
[21:14:32] <chasemp>	 andrewbogott: yeah, so far DNS works for me generally for existing things but maybe isn't working for new instances?
[21:15:29] <andrewbogott>	 lots of timeouts in the designate-sink log
[21:15:38] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[21:16:19] <andrewbogott>	 !log restarting designate-sink on labservices1001
[21:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:45] <chasemp>	 andrewbogott: are labservices hosts ssh'ing to labcontrol?
[21:18:04] <chasemp>	 that could ahve been mysteriously allowed via https://gerrit.wikimedia.org/r/#/c/374424/2 previously and now timing out
[21:18:09] <andrewbogott>	 oh
[21:18:16] <andrewbogott>	 yes, if there's not an explicit rule for that there should be
[21:18:18] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 409.45 seconds
[21:18:23] <andrewbogott>	 in order to clean up salt certificates
[21:18:24] <chasemp>	 let me look
[21:19:07] <wikibugs_>	 (03PS1) 10Rush: Revert "openstack: remove legacy firewall rules for controller" [puppet] - 10https://gerrit.wikimedia.org/r/374641
[21:19:13] <wikibugs_>	 (03PS2) 10Rush: Revert "openstack: remove legacy firewall rules for controller" [puppet] - 10https://gerrit.wikimedia.org/r/374641
[21:19:49] <wikibugs_>	 (03CR) 10Rush: [C: 032] Revert "openstack: remove legacy firewall rules for controller" [puppet] - 10https://gerrit.wikimedia.org/r/374641 (owner: 10Rush)
[21:20:09] <chasemp>	 andrewbogott: I'm going to revert that cleanup to work on explicit rules for the moment
[21:20:17] <andrewbogott>	 ok, thanks
[21:20:47] <andrewbogott>	 I guess is that this is partly due to the fact that I separated the salt and puppetmasters.  So there's an explicit rule but that moved to the new puppetmasters… leaving the salt master untended.
[21:21:16] <chasemp>	 ah, yeah it's also what makes overly broad and permissive rules difficult to reason about
[21:21:57] <andrewbogott>	 so we want something like
[21:21:59] <andrewbogott>	 https://www.irccloud.com/pastebin/12xtnM5b/
[21:22:03] <chasemp>	 andrewbogott: I'm merging and applying, coudl you clear out the nova-fullstack backlog?
[21:22:05] <andrewbogott>	 on labcontrol
[21:22:08] <andrewbogott>	 yep
[21:23:18] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.29 seconds
[21:24:39] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[21:25:45] <chasemp>	 watching a test run through now
[21:25:55] <andrewbogott>	 dns looks better
[21:26:06] <andrewbogott>	 I restarted fullstack probably right after you did so we might leak one of those two that are running now
[21:26:40] <chasemp>	 heh kk
[21:26:45] <chasemp>	 should have cooridinated
[21:27:52] <andrewbogott>	 ok if I delete fullstackd-1504041863 now so it's actually clean?
[21:28:03] <chasemp>	 andrewbogott: so seems to be working, and somehow DNS failures bubble up to be a funky puppet error
[21:28:05] <chasemp>	 andrewbogott: yup
[21:28:20] <chasemp>	 but it's kind of cool that this issue surfaced in short order in an expected place 
[21:28:33] <andrewbogott>	 yeah, better than finding out about it tomorrow from a user
[21:28:46] <andrewbogott>	 Do you want to write the proper firewall patch or shall I?
[21:29:09] <chasemp>	 andrewbogott: go for it, can you think of anything else that ssh's to labcontrol automation wise?
[21:29:33] <andrewbogott>	 ...
[21:29:36] <tabbycat>	 https://ru.wikipedia.org/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%92%D0%BA%D0%BB%D0%B0%D0%B4/128.0.0.0/2 <-- this block is not right
[21:29:42] <tabbycat>	 a /2 block?!
[21:29:50] <tabbycat>	 what is that? a zillion of IPs?
[21:32:15] <chasemp>	 tabbycat something like 1073741824
[21:33:08] <urandom>	 !log T169939: Decommission Cassandra: restbase1008-c.eqiad.wmnet
[21:33:15] <jynus>	 tabbycat, relevant: https://labs.ripe.net/Members/emileaben/the-curious-case-of-128.0-16
[21:33:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:22] <stashbot>	 T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939
[21:33:24] <tabbycat>	 a billion of IPs, nah, doesn't look too bad :P
[21:35:56] <wikibugs_>	 (03PS1) 10Andrew Bogott: openstack: refine firewall rules for controller [puppet] - 10https://gerrit.wikimedia.org/r/374644 (https://phabricator.wikimedia.org/T171494)
[21:36:17] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: refine firewall rules for controller [puppet] - 10https://gerrit.wikimedia.org/r/374644 (https://phabricator.wikimedia.org/T171494) (owner: 10Andrew Bogott)
[21:36:24] <andrewbogott>	 chasemp: I think ^ is right but let's wait until tomorrow to merge
[21:37:22] <chasemp>	 andrewbogott: kk but modules/role/manifests/labs/openstack/nova/controller.pp:45 ERROR single quoted string containing a variable found (single_quote_string_with_variables)
[21:37:35] <andrewbogott>	 yeah, fixed
[21:37:36] <wikibugs_>	 (03PS2) 10Andrew Bogott: openstack: refine firewall rules for controller [puppet] - 10https://gerrit.wikimedia.org/r/374644 (https://phabricator.wikimedia.org/T171494)
[21:37:39] <chasemp>	 sweet
[21:38:07] <wikibugs_>	 (03CR) 10Rush: [C: 031] "nice, tomorrow we march" [puppet] - 10https://gerrit.wikimedia.org/r/374644 (https://phabricator.wikimedia.org/T171494) (owner: 10Andrew Bogott)
[21:45:19] <andrewbogott>	 chasemp: sorry about all the landmines in that openstack code!  It suffers from 5 years of incrementalism
[21:45:55] <chasemp>	 andrewbogott: no worries, I appreciate you helping me walk back through it
[21:46:58] <eddiegp>	 tabbycat: Seems to be some kind of display issue. 128./2 would be 128.0.0.0-191.255.255.255, but I'm fine editing with a 134.2. address (logged out). So I guess it's not really the case that they're blocking one fourth of all ip addresses ;) Mediawiki doesn't allow blocking to big ranges anyways afaik.
[21:47:44] <tabbycat>	 most likely... I know mw do not allow rangeblocks larger than /16 on IPv4
[21:48:11] <tabbycat>	 I was doing some CU stuff and saw that and was like... hello?!
[21:48:50] <eddiegp>	 Yeah, I thought the same when reading this, that's why I tried out with some IP from that range. :D
[21:54:03] <wikibugs_>	 (03PS1) 10RobH: further tweaks to kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/374645 (https://phabricator.wikimedia.org/T174457)
[22:23:12] <wikibugs_>	 (03PS1) 10Rush: prometheus: allow setting a specific listening address and port [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039)
[22:55:02] <wikibugs_>	 (03PS1) 10MaxSem: Migrate AbuseFilter config off wmg variables, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374651
[22:55:04] <wikibugs_>	 (03PS1) 10MaxSem: Migrate AbuseFilter config off wmg variables, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374652
[22:55:06] <wikibugs_>	 (03PS1) 10MaxSem: Move a variable closer to other relevant code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374653
[22:56:36] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Migrate AbuseFilter config off wmg variables, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374651 (owner: 10MaxSem)
[22:57:59] <wikibugs_>	 (03PS1) 10EBernhardson: Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106)
[22:59:01] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) (owner: 10EBernhardson)
[23:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T2300).
[23:00:05] <jouncebot>	 bmansurov: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:59] <bmansurov>	 here
[23:01:29] <RainbowSprinkles>	 I can swat it
[23:02:05] <wikibugs_>	 (03PS2) 10EBernhardson: Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106)
[23:03:45] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) (owner: 10EBernhardson)
[23:08:29] <wikibugs_>	 (03PS3) 10GWicke: Enable JobQueueEventBus on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374399 (owner: 10Ppchelko)
[23:17:51] <bmansurov>	 RainbowSprinkles, thanks for swatting, where can I see the change? Or is it not live yet?
[23:29:39] <wikibugs_>	 (03PS2) 10MaxSem: Migrate AbuseFilter config off wmg variables, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374651
[23:29:41] <wikibugs_>	 (03PS2) 10MaxSem: Migrate AbuseFilter config off wmg variables, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374652
[23:29:43] <wikibugs_>	 (03PS2) 10MaxSem: Move a variable closer to other relevant code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374653
[23:29:47] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3564554 (10Cmjohnson) Spoke with HP support, not very helpful. They will not send anyone to help unless we want to pay for it.  Going to try and talk...
[23:32:01] <RainbowSprinkles>	 bmansurov: I'm so sorry, got distracted going down a rabbit hole. It's merged, one sec and I'll sync it live
[23:32:16] <bmansurov>	 thanks
[23:33:40] <logmsgbot>	 !log demon@tin Synchronized php-1.30.0-wmf.16/extensions/Popups/: swat (duration: 00m 50s)
[23:33:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:55] <RainbowSprinkles>	 Wait, wrong directory
[23:33:56] <RainbowSprinkles>	 lol
[23:34:42] <Zppix>	 Lol
[23:34:49] <logmsgbot>	 !log demon@tin Synchronized php-1.30.0-wmf.15/extensions/Popups/: swat (duration: 00m 47s)
[23:35:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:40] <RainbowSprinkles>	 Ok there we go
[23:36:11] <bmansurov>	 I see the change at 1002
[23:36:33] <RainbowSprinkles>	 It's live everywhere :)
[23:36:35] <bmansurov>	 RainbowSprinkles, thanks!
[23:47:08] <wikibugs_>	 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3564575 (10Eevans)
[23:49:58] <icinga-wm>	 PROBLEM - Host mw1228 is DOWN: PING CRITICAL - Packet loss = 100%
[23:50:15] <cmjohnson1>	 !log reinstalling mw1228
[23:50:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:18] <icinga-wm>	 RECOVERY - Host mw1228 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms