[00:25:17] !log T169939: Decommissioning restbase1010-a.eqiad.wmnet [00:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:32] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [00:25:49] (03PS1) 10Reedy: $wgScoreSafeMode = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374445 (https://phabricator.wikimedia.org/T174413) [00:26:47] (03PS1) 10Ebe123: Set $wgScoreSafeMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) [00:30:33] (03CR) 10MZMcBride: "Dupe of ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) (owner: 10Ebe123) [00:37:37] (03CR) 10Ebe123: "> Dupe of ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) (owner: 10Ebe123) [00:40:44] (03CR) 10MZMcBride: "True. :-) And you filed the Phabricator task. I think Reedy should abandon his changeset." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) (owner: 10Ebe123) [00:49:52] (03Abandoned) 10Reedy: $wgScoreSafeMode = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374445 (https://phabricator.wikimedia.org/T174413) (owner: 10Reedy) [01:04:23] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:05:24] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9813380 keys, up 5 minutes 17 seconds - replication_delay is 0 [01:36:44] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 40 [02:07:53] PROBLEM - cassandra-a SSL 10.64.0.114:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [02:10:33] 10Operations, 10ops-eqiad, 10Services (doing): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3560918 (10Eevans) We needed to decommission a node in rack 'a' as part of {T169939}, that was going to be 1007 (for consistency sake), but restbase1010 has been decommissioned ins... [02:13:54] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3560922 (10Eevans) [02:15:03] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.48.46:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned (T169939) [02:16:03] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.0.114:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned (T169939) [02:27:26] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.15) (duration: 08m 00s) [02:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Aug 29 02:34:09 UTC 2017 (duration 6m 44s) [02:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:13] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [03:25:19] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3560941 (10Papaul) @madhuvishy I took a quick look at labstore2001 the H800 controller doesn't allow me to create a RAID... [03:28:43] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [03:31:54] PROBLEM - Host labstore2002 is DOWN: PING CRITICAL - Packet loss = 100% [03:32:33] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.09 seconds [03:53:53] RECOVERY - Host labstore2002 is UP: PING OK - Packet loss = 0%, RTA = 36.27 ms [04:41:24] (03CR) 10Phedenskog: "Anything more that needs to be done on this? I'm waiting for this to be pushed before I can push my changes :)" [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [04:44:27] 10Operations, 10Dumps-Generation, 10Patch-For-Review: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3561021 (10ArielGlenn) A few more thoughts. I should stop thinking of this as an rsync and instead think of it as a copy of files that don't exist/need updat... [04:55:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 264.49 seconds [05:27:35] (03CR) 10Reception123: [C: 031] Gerrit: Set auth.userNameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/368196 (owner: 10Paladox) [05:33:55] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3561111 (10madhuvishy) @Papaul, Hardware RAID 10 on both labstore2001 and 2002, with 6 or 8 disks per logical/virtual RAID... [05:49:00] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.11 (duration: 02m 50s) [05:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:44] (03PS1) 10Chad: Fixing indentation warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374453 [05:58:46] (03CR) 10Chad: [C: 032] Fixing indentation warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374453 (owner: 10Chad) [06:00:11] (03PS2) 10Marostegui: Add electcomwiki to private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/374384 (https://phabricator.wikimedia.org/T174370) (owner: 10Reedy) [06:00:13] (03Merged) 10jenkins-bot: Fixing indentation warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374453 (owner: 10Chad) [06:00:27] (03CR) 10jenkins-bot: Fixing indentation warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374453 (owner: 10Chad) [06:01:12] (03CR) 10Marostegui: [C: 032] Add electcomwiki to private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/374384 (https://phabricator.wikimedia.org/T174370) (owner: 10Reedy) [06:05:01] !log Restart MariaDB on db1102 and db1095 to pick up new replication filters - T174385 [06:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:16] T174385: Prepare and check storage layer for electcomwiki - https://phabricator.wikimedia.org/T174385 [06:09:34] !log demon@tin Synchronized scap/plugins/clean.py: no-op, for consistency (duration: 00m 43s) [06:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:59] 10Operations, 10DBA, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3561141 (10Marostegui) [06:29:12] 10Operations, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3561142 (10Marostegui) I have closed the ticket that relates to the DBAs (add the replication filters and restart MariaDB on the sanitarium hosts). Going to remove the... [06:44:03] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 57491 MB (3% inode=97%) [06:56:28] !log installing ghostscript security updates on trusty (Debian already fixed) [06:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:35] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374476 (https://phabricator.wikimedia.org/T168661) [07:06:20] 10Operations, 10Traffic: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3561169 (10ema) 05Open>03Resolved Looks good, thanks @Cmjohnson! [07:13:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374476 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:14:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374476 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:16:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097 - T168661 (duration: 00m 42s) [07:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:16] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [07:17:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374476 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:19:58] (03PS1) 10Marostegui: mariadb: Update db1091 socket location [puppet] - 10https://gerrit.wikimedia.org/r/374487 (https://phabricator.wikimedia.org/T148507) [07:32:45] RECOVERY - Check systemd state on mw1259 is OK: OK - running: The system is fully operational [07:33:20] systemctl reset-failed puppet.service --^ [07:41:18] !log Upgrade MariaDB on db1091 to 10.0.32 - T168661 [07:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:31] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [07:45:28] 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Audit groups of metrics in Graphite that allocate a lot of disk space - https://phabricator.wikimedia.org/T1075#3561227 (10fgiunchedi) [07:45:51] 10Operations, 10Analytics: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3561228 (10fgiunchedi) [07:46:03] 10Operations, 10Analytics, 10monitoring: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3106537 (10fgiunchedi) [07:46:06] elukey: reopened ^ [07:47:07] godog: ack! [07:49:24] (03CR) 10Muehlenhoff: [C: 04-1] "ffmpeg2theora is current still needed in the TMH extension to differentiate between Ogg Vorbis audio files and Ogg Theora video files, nee" [puppet] - 10https://gerrit.wikimedia.org/r/373733 (https://phabricator.wikimedia.org/T172445) (owner: 10Muehlenhoff) [07:49:24] godog: in the meantime I'll try to delete mtime +7 [07:49:59] elukey: ok thanks! [07:51:14] RECOVERY - Disk space on graphite1001 is OK: DISK OK [07:55:05] (03CR) 10Marostegui: [C: 032] mariadb: Update db1091 socket location [puppet] - 10https://gerrit.wikimedia.org/r/374487 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui) [07:56:09] (03PS3) 10Giuseppe Lavagetto: git::clone: enhance compatibility with the future parser [puppet] - 10https://gerrit.wikimedia.org/r/374321 (https://phabricator.wikimedia.org/T171704) [07:57:36] (03PS1) 10Elukey: role::graphite::production: lower down eventstreams rdkafka retention [puppet] - 10https://gerrit.wikimedia.org/r/374500 (https://phabricator.wikimedia.org/T160644) [07:57:45] godog: --^ [07:58:05] (03CR) 10Giuseppe Lavagetto: [C: 032] git::clone: enhance compatibility with the future parser [puppet] - 10https://gerrit.wikimedia.org/r/374321 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [07:58:13] (03CR) 10Elukey: [C: 032] role::graphite::production: lower down eventstreams rdkafka retention [puppet] - 10https://gerrit.wikimedia.org/r/374500 (https://phabricator.wikimedia.org/T160644) (owner: 10Elukey) [07:58:18] (03PS2) 10Elukey: role::graphite::production: lower down eventstreams rdkafka retention [puppet] - 10https://gerrit.wikimedia.org/r/374500 (https://phabricator.wikimedia.org/T160644) [07:58:21] (03CR) 10Elukey: [V: 032 C: 032] role::graphite::production: lower down eventstreams rdkafka retention [puppet] - 10https://gerrit.wikimedia.org/r/374500 (https://phabricator.wikimedia.org/T160644) (owner: 10Elukey) [07:58:29] <_joe_> elukey: merge my change when you're done [07:58:53] _joe_ ack [07:59:00] <_joe_> and btw now verification takes below 20 seconds, don't V+2 yourself [07:59:54] _joe_ I got +2 literally two seconds before that [08:01:04] (03PS2) 10Giuseppe Lavagetto: role::mariadb::misc: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374349 (https://phabricator.wikimedia.org/T171704) [08:01:52] 10Operations, 10Analytics, 10monitoring, 10Patch-For-Review: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3561239 (10elukey) The other step to take would be to limit the amount of data that we store for librkafka, because with so many clients it is impossible to keep track o... [08:02:42] (03PS1) 10Marostegui: db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374501 (https://phabricator.wikimedia.org/T168661) [08:03:15] (03CR) 10Alexandros Kosiaris: [C: 031] Icinga: Add basic monitoring for routers' active RE [puppet] - 10https://gerrit.wikimedia.org/r/374435 (https://phabricator.wikimedia.org/T174397) (owner: 10Ayounsi) [08:03:31] (03PS2) 10Marostegui: db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374501 (https://phabricator.wikimedia.org/T168661) [08:04:17] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3561243 (10elukey) [08:04:20] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3561241 (10elukey) 05Open>03Resolved Closing this task since the hw issue should have been resolved. Will re-open if necessary. Thanks @Papaul for the work done! [08:06:37] !log drop log.MobileWebUIClickTracking_10742159_15423246 from dbstore1002 to free space (table archived on HDFS) - T172322 T168303 [08:06:43] marostegui: --^ [08:06:48] <3 [08:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374501 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:06:52] T172322: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322 [08:06:52] T168303: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303 [08:06:53] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3561251 (10akosiaris) 05stalled>03Open a:03fgiunchedi Since the upgrade is done, I am reverting actions taken in T171167#3536747 and T171167#3... [08:06:58] is it a big one? [08:08:16] marostegui: 500GB on paper, but probably a lot less on disk [08:08:19] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374501 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:08:28] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374501 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:09:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1091 with low weight - T168661 (duration: 00m 43s) [08:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:31] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [08:13:17] (03CR) 10Alexandros Kosiaris: [C: 031] "With notification_options => 'c,r,f' for monitoring::service, yes this will work as advertised." [puppet] - 10https://gerrit.wikimedia.org/r/374368 (https://phabricator.wikimedia.org/T172131) (owner: 10Herron) [08:13:29] (03PS2) 10ArielGlenn: add user and directory setup to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374242 (https://phabricator.wikimedia.org/T169849) [08:15:20] 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3561258 (10fgiunchedi) For power usage my first attempt is something like this to calculate watts for 3 phase PDU: `sum(current) * avg(voltage) * sqrt(3)` Or a... [08:20:00] !log reimaging mw1169 (video scaler) to jessie [08:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:37] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3561262 (10fgiunchedi) 05Open>03Resolved Thanks @akosiaris @ayounsi ! No more invalid metrics in graphite logs AFAICS, resolving! [08:32:03] !log apt-get upgrade on contint1001 and contint2001 [08:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:42] (03PS1) 10Marostegui: db-eqiad.php: Increase db1091 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374503 [08:34:57] (03CR) 10Filippo Giunchedi: "LGTM! Please add a new entry to debian/changelog with a new version number (1.2-3). The "dch" tool in "devscripts" package can help you do" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/373595 (https://phabricator.wikimedia.org/T161719) (owner: 10Matthias Mullie) [08:35:55] !log restart wdqs-updater on wdqs2001 [08:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1091 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374503 (owner: 10Marostegui) [08:39:22] (03CR) 10Filippo Giunchedi: "> So this is ~50G for the data raid-1? If so, that seems to be about" [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [08:40:38] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1091 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374503 (owner: 10Marostegui) [08:40:48] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1091 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374503 (owner: 10Marostegui) [08:41:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1091 weight - T168661 (duration: 00m 43s) [08:41:49] !log restarting archiva to pick up openjdk security update [08:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:02] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [08:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:13] !log upload kubernetes_1.7.4-1 to apt.wikimedia.org/stretch-wikimedia/main T170119 [08:42:19] !log restart yarn/hdfs daemons for openjdk security updates [08:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:25] T170119: Upgrade to kubernetes >=1.5 - https://phabricator.wikimedia.org/T170119 [08:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:07] !log Restarting Jenkins for openjdk update [08:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:26] 10Operations, 10media-storage: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374#3561301 (10Nick) Also needs File:Literature II tom, Harutyun Surkhatian.djvu deleted, created by same user with same deleti... [08:44:56] (03PS2) 10Filippo Giunchedi: install_server: add partman for cassandra JBOD [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) [08:45:40] !log reprepro copy calico, calico-cni from jessie-wikimedia to stretch-wikimedia (apt.wikimedia.org) T170119 [08:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:52] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: send wdqs logs to logstash - https://phabricator.wikimedia.org/T172710#3561315 (10Gehel) Logs are now sent to logstash, but the "host" field isn't set correctly (its value is always "%{HOSTNAME}". Some analysis: * logs... [08:53:20] !log upgrading cache_text to varnish 4.1.8-1wm1 [08:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:55] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mariadb::misc: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374349 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [09:07:05] (03PS2) 10Giuseppe Lavagetto: phabricator::logmail: fix scoping of templates [puppet] - 10https://gerrit.wikimedia.org/r/374350 (https://phabricator.wikimedia.org/T171704) [09:07:08] (03PS3) 10ArielGlenn: add user and directory setup to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374242 (https://phabricator.wikimedia.org/T169849) [09:14:48] (03PS1) 10Marostegui: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374509 [09:17:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374509 (owner: 10Marostegui) [09:18:31] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374509 (owner: 10Marostegui) [09:18:41] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374509 (owner: 10Marostegui) [09:19:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1091 original weight - T168661 (duration: 00m 43s) [09:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:42] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [09:24:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 for MariaDB upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374510 (https://phabricator.wikimedia.org/T168661) [09:27:11] (03PS1) 10Alexandros Kosiaris: Reimage kubernetes1004, chlorine as stretch [puppet] - 10https://gerrit.wikimedia.org/r/374511 (https://phabricator.wikimedia.org/T170119) [09:27:58] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:28:01] (03PS3) 10Giuseppe Lavagetto: phabricator::logmail: fix scoping of templates [puppet] - 10https://gerrit.wikimedia.org/r/374350 (https://phabricator.wikimedia.org/T171704) [09:28:59] PROBLEM - mediawiki-installation DSH group on mw1169 is CRITICAL: Host mw1169 is not in mediawiki-installation dsh group [09:29:13] !log re-installed pmacct/librdkafka1/kafkacat on rhenium with stretch versions - T173489 [09:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:28] T173489: pmacct should be upgraded to 1.6.2 on Stretch - https://phabricator.wikimedia.org/T173489 [09:29:41] paravoid: --^ [09:29:50] so puppet is re-enabled [09:33:38] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1169.eqiad.wmnet [09:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:15] (03Abandoned) 10Elukey: profile::pmacct: pin librdkafka to stretch version [puppet] - 10https://gerrit.wikimedia.org/r/374360 (https://phabricator.wikimedia.org/T173489) (owner: 10Elukey) [09:36:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1064 for MariaDB upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374510 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [09:37:38] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 for MariaDB upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374510 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [09:37:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 for MariaDB upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374510 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [09:38:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1064 for a mariadb upgrade - T168661 (duration: 00m 43s) [09:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:57] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [09:39:41] !log Update MariaDB on db1064 to 10.0.32 - T168661 [09:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:15] (03PS1) 10Gehel: wdqs - change logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710) [09:40:38] (03CR) 10jerkins-bot: [V: 04-1] wdqs - change logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [09:41:43] (03PS2) 10Gehel: wdqs - change logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710) [09:42:10] (03CR) 10jerkins-bot: [V: 04-1] wdqs - change logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [09:43:00] (03PS3) 10Gehel: wdqs - logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710) [09:45:03] (03CR) 10Giuseppe Lavagetto: [C: 032] phabricator::logmail: fix scoping of templates [puppet] - 10https://gerrit.wikimedia.org/r/374350 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [09:45:31] (03PS1) 10Marostegui: mariadb: Update socket location for db1064 [puppet] - 10https://gerrit.wikimedia.org/r/374514 (https://phabricator.wikimedia.org/T148507) [09:45:51] (03PS2) 10Marostegui: mariadb: Update socket location for db1064 [puppet] - 10https://gerrit.wikimedia.org/r/374514 (https://phabricator.wikimedia.org/T148507) [09:46:39] (03CR) 10Marostegui: [C: 032] mariadb: Update socket location for db1064 [puppet] - 10https://gerrit.wikimedia.org/r/374514 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui) [09:47:15] _joe_: is it ok to merge your changes? [09:47:29] <_joe_> marostegui: yeah sorry [09:47:36] np, will do it now [09:47:37] <_joe_> I was verifying the next ones [09:50:36] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1064 for MariaDB upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374515 [09:51:56] (03PS2) 10Giuseppe Lavagetto: role::mariadb::misc::phabricator: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374351 (https://phabricator.wikimedia.org/T171704) [09:54:25] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mariadb::misc::phabricator: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374351 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [09:57:01] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1064 for MariaDB upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374515 (owner: 10Marostegui) [09:57:25] (03PS2) 10Giuseppe Lavagetto: requesttracker::config: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374352 (https://phabricator.wikimedia.org/T171704) [09:58:22] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064 for MariaDB upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374515 (owner: 10Marostegui) [09:58:32] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064 for MariaDB upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374515 (owner: 10Marostegui) [09:59:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1064 - T168661 (duration: 00m 43s) [09:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:39] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [10:00:37] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [10:02:37] <_joe_> uhm [10:02:43] <_joe_> that host has troubles [10:03:23] (03PS3) 10Jcrespo: mariadb: Decommission db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373869 (https://phabricator.wikimedia.org/T174076) [10:03:28] nooooooo [10:03:36] * elukey cries [10:03:43] we just replaced the mainboard... [10:03:50] and again it breaks.. [10:05:26] so it was not that :D [10:05:50] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3561443 (10elukey) 05Resolved>03Open [10:05:52] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3561444 (10elukey) [10:06:41] (03CR) 10Jcrespo: [C: 032] mariadb: Decommission db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373869 (https://phabricator.wikimedia.org/T174076) (owner: 10Jcrespo) [10:07:04] (03CR) 10Giuseppe Lavagetto: [C: 032] requesttracker::config: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374352 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [10:08:11] (03Merged) 10jenkins-bot: mariadb: Decommission db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373869 (https://phabricator.wikimedia.org/T174076) (owner: 10Jcrespo) [10:08:20] (03CR) 10jenkins-bot: mariadb: Decommission db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373869 (https://phabricator.wikimedia.org/T174076) (owner: 10Jcrespo) [10:08:27] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3561451 (10elukey) Host frozen again, not responding to ssh and pings, com2 shows `[82623.895993] g` [10:10:07] (03PS2) 10Giuseppe Lavagetto: ganglia::gmetad::rrdcached: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374353 (https://phabricator.wikimedia.org/T171704) [10:14:29] we should depool mw2256 from deployment [10:15:09] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms [10:15:10] !log jynus@tin Synchronized wmf-config/db-codfw.php: decom db1028 (duration: 02m 48s) [10:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:27] jynus: yep it was until yesterday, I thought we solved the issue, going to depool again [10:16:25] !log jynus@tin Synchronized wmf-config/db-eqiad.php: decom db1028 (duration: 00m 42s) [10:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:20] moritzm: something strange is that mw2256 is running with 4.9.30-2+deb9u2~bpo8+1 [10:18:33] just noticed it [10:18:39] the others of its batch have 4.9.25-1~bpo8+3 [10:19:04] (03CR) 10Filippo Giunchedi: "PCC says yes https://puppet-compiler.wmflabs.org/compiler02/7637/lvs1003.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/372517 (https://phabricator.wikimedia.org/T172930) (owner: 10Filippo Giunchedi) [10:19:44] ah maybe it was due to the last reimage, doesn't count much [10:20:30] (03CR) 10Ema: [C: 031] hieradata: bump ProxyFetch timeout for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372517 (https://phabricator.wikimedia.org/T172930) (owner: 10Filippo Giunchedi) [10:22:12] it was 4.9.25-1~bpo8+3 when we sent the sos report, just checked [10:22:37] (03PS2) 10Filippo Giunchedi: hieradata: bump ProxyFetch timeout for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372517 (https://phabricator.wikimedia.org/T172930) [10:23:32] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: bump ProxyFetch timeout for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372517 (https://phabricator.wikimedia.org/T172930) (owner: 10Filippo Giunchedi) [10:24:50] _joe_: merging your change too [10:24:59] <_joe_> ouch sorry [10:25:01] <_joe_> thanks [10:26:46] np! [10:28:16] (03PS4) 10ArielGlenn: add user and directory setup to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374242 (https://phabricator.wikimedia.org/T169849) [10:29:00] RECOVERY - mediawiki-installation DSH group on mw1169 is OK: OK [10:29:10] elukey: that's because you reimaged it and it got installed with the latest kernel [10:29:37] all fine, the 4.9.30 update hasn't been rolled out fleet-wide [10:30:01] yep yep [10:30:35] I checked the sosreport and 4.9.25-1~bpo8+3 was installed, so my pebkac [10:30:39] :) [10:31:33] !log reimaging mw1168 (video scaler) to jessie [10:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:00] hi, is someone online whom I can ask a few questions about the object cache? [10:32:41] (03CR) 10ArielGlenn: [C: 032] add user and directory setup to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374242 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [10:36:28] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374521 (https://phabricator.wikimedia.org/T161088) [10:36:59] PROBLEM - puppet last run on dumpsdata1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/data/xmldatadumps/public/wikidatawiki/entities] [10:39:21] (03PS3) 10Matthias Mullie: Add missing THREED2PNG_PATH [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/373595 (https://phabricator.wikimedia.org/T161719) [10:43:35] (03PS2) 10Zfilipin: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374071 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [10:45:30] (03PS1) 10ArielGlenn: manually create wikidatawiki dumps dir on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374522 [10:45:44] ignore the dumpsdata whine, fixing [10:47:40] 10Operations, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3561542 (10Urbanecm) p:05Triage>03Low [10:47:43] (03PS2) 10ArielGlenn: manually create wikidatawiki dumps dir on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374522 [10:48:00] !log installing libxml2 security updates [10:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:37] 10Operations, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3559677 (10Urbanecm) Will create config. [10:49:34] (03CR) 10ArielGlenn: [C: 032] manually create wikidatawiki dumps dir on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/374522 (owner: 10ArielGlenn) [10:51:12] RECOVERY - puppet last run on dumpsdata1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [10:52:46] (03PS4) 10Hashar: apt:pin pref file must not have space [puppet] - 10https://gerrit.wikimedia.org/r/353540 [10:52:48] (03PS1) 10Hashar: apt: spec boiler plate [puppet] - 10https://gerrit.wikimedia.org/r/374527 [10:53:20] (03CR) 10Hashar: "I have added the boiler plate for a follow up change https://gerrit.wikimedia.org/r/#/c/353540/" [puppet] - 10https://gerrit.wikimedia.org/r/374527 (owner: 10Hashar) [10:53:35] (03CR) 10Hashar: "Rebased and amended to add a basic spec test." [puppet] - 10https://gerrit.wikimedia.org/r/353540 (owner: 10Hashar) [10:54:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374521 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [10:55:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374521 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [10:55:53] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2065605 [10:56:17] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374521 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [10:57:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097 - T161088 (duration: 00m 43s) [10:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:44] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [10:58:32] (03PS1) 10Urbanecm: Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) [10:59:57] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) (owner: 10Urbanecm) [11:02:02] (03PS1) 10Jcrespo: mariadb: Decomission db1033 and db1028 [puppet] - 10https://gerrit.wikimedia.org/r/374529 (https://phabricator.wikimedia.org/T174076) [11:03:23] (03PS1) 10Jcrespo: mariadb: Decommission db1033 and db1028 [software] - 10https://gerrit.wikimedia.org/r/374530 (https://phabricator.wikimedia.org/T174076) [11:04:17] (03PS2) 10Jcrespo: mariadb: Decommission db1033 and db1028 [puppet] - 10https://gerrit.wikimedia.org/r/374529 (https://phabricator.wikimedia.org/T174076) [11:05:36] (03PS2) 10Urbanecm: Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) [11:08:13] !log Stop MariaDB on db1097 to migrate it to file per table - T161088 [11:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:26] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [11:12:55] (03CR) 10Jcrespo: [C: 032] mariadb: Decommission db1033 and db1028 [software] - 10https://gerrit.wikimedia.org/r/374530 (https://phabricator.wikimedia.org/T174076) (owner: 10Jcrespo) [11:13:07] (03CR) 10Jcrespo: [C: 032] mariadb: Decommission db1033 and db1028 [puppet] - 10https://gerrit.wikimedia.org/r/374529 (https://phabricator.wikimedia.org/T174076) (owner: 10Jcrespo) [11:15:11] PROBLEM - mediawiki-installation DSH group on mw1168 is CRITICAL: Host mw1168 is not in mediawiki-installation dsh group [11:15:21] PROBLEM - DPKG on kubernetes1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:15:21] PROBLEM - Check size of conntrack table on mw1168 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:10] PROBLEM - mediawiki-installation DSH group on mw2256 is CRITICAL: Host mw2256 is not in mediawiki-installation dsh group [11:16:20] PROBLEM - Disk space on kubernetes1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:20] PROBLEM - Check systemd state on mw1168 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:20] PROBLEM - nutcracker port on mw1168 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:01] PROBLEM - nutcracker process on mw1168 is CRITICAL: Return code of 255 is out of bounds [11:17:01] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1168 is CRITICAL: Return code of 255 is out of bounds [11:17:20] RECOVERY - Check size of conntrack table on mw1168 is OK: OK: nf_conntrack is 0 % full [11:17:51] PROBLEM - MD RAID on kubernetes1004 is CRITICAL: Return code of 255 is out of bounds [11:19:00] RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [11:19:07] kubernetes1004 is known [11:19:11] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK [11:19:16] damn icinga was faster than the reimage [11:19:20] RECOVERY - DPKG on kubernetes1004 is OK: All packages OK [11:19:22] ^ mw1168 is a reimage, silencing it [11:21:30] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 2 minutes ago with 5 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX],Logical_volume[data] [11:22:20] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:25:26] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational [11:25:47] (03PS1) 10Jcrespo: mariadb: Remove db1028 and db1033 from hiera [puppet] - 10https://gerrit.wikimedia.org/r/374531 (https://phabricator.wikimedia.org/T174076) [11:26:17] (03CR) 10Jcrespo: [C: 032] mariadb: Remove db1028 and db1033 from hiera [puppet] - 10https://gerrit.wikimedia.org/r/374531 (https://phabricator.wikimedia.org/T174076) (owner: 10Jcrespo) [11:26:36] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [11:27:11] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3561653 (10Johan) Thanks for reporting. We've looked into... [11:28:26] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:29:16] RECOVERY - nutcracker process on mw1168 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [11:29:26] RECOVERY - nutcracker port on mw1168 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [11:29:26] RECOVERY - Check systemd state on mw1168 is OK: OK - running: The system is fully operational [11:29:42] 10Operations, 10ops-esams, 10DNS, 10Traffic, 10netops: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#3561654 (10faidon) 05Open>03Resolved [11:30:02] 10Operations, 10ops-eqiad, 10DBA: Decommission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3561656 (10jcrespo) [11:30:50] !log restart java daemons on analytics100[1,2] (Hadoop Master nodes) for jvm updates [11:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:48] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1168.eqiad.wmnet [11:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:55] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#3561669 (10MoritzMuehlenhoff) [11:44:57] 10Operations, 10Operations-Software-Development, 10HHVM, 10Patch-For-Review: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536#3561670 (10MoritzMuehlenhoff) [11:45:01] 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#3561667 (10MoritzMuehlenhoff) 05Open>03Resolved Migration to jessie is completed. [11:45:16] 10Operations, 10Operations-Software-Development, 10HHVM, 10Patch-For-Review: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536#2571031 (10MoritzMuehlenhoff) [11:45:29] 10Operations, 10Operations-Software-Development, 10HHVM, 10Patch-For-Review: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536#2571031 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Migration to jessie is complete [11:47:03] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1168 is OK: OK: synced at Tue 2017-08-29 11:47:01 UTC. [11:47:06] moritzm: great work --^ [11:47:11] yup! [11:47:15] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#3561690 (10MoritzMuehlenhoff) All the packaging work for jessie is complete (and the servers in production have been migrated). If deployment-tmh01 is still used it can be re... [11:48:18] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3561691 (10Marostegui) How do you guys want to proceed with this in the end? Is it worth the risk? [11:49:27] !log restart kafka daemons on kafka1013 for jvm security updates [11:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:13] PROBLEM - Nginx local proxy to apache on mw1287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [11:55:32] PROBLEM - HHVM rendering on mw1287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [11:55:32] PROBLEM - Apache HTTP on mw1287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [11:55:54] PROBLEM - Apache HTTP on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [11:56:02] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [11:56:13] RECOVERY - Nginx local proxy to apache on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.082 second response time [11:56:32] RECOVERY - HHVM rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 74335 bytes in 0.539 second response time [11:56:32] RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.036 second response time [11:56:52] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.458 second response time [11:57:03] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.072 second response time [12:00:42] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational [12:03:42] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:15:13] RECOVERY - mediawiki-installation DSH group on mw1168 is OK: OK [12:15:23] (03PS1) 10Elukey: role::analytics::hadoop::master: fix descriptions of HDFS alarms [puppet] - 10https://gerrit.wikimedia.org/r/374536 [12:15:56] (03CR) 10Elukey: [C: 032] role::analytics::hadoop::master: fix descriptions of HDFS alarms [puppet] - 10https://gerrit.wikimedia.org/r/374536 (owner: 10Elukey) [12:25:20] 10Operations, 10HHVM: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3561778 (10MoritzMuehlenhoff) [12:25:52] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3561794 (10ovasileva) [12:28:59] 10Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 10Patch-For-Review, and 2 others: Collate wikimedia pages into a single html wikimedia page that can then be rendered into a single pdf - https://phabricator.wikimedia.org/T150874#3561803 (10ovasileva) [12:31:26] !log bounce pybal on lvs1003 to pick up config changes [12:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:34] 10Operations, 10Traffic: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3561810 (10fgiunchedi) [12:37:40] !log restart kafka daemons on kafka1014 for jvm security updates [12:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:34] (03PS3) 10Giuseppe Lavagetto: ganglia::gmetad::rrdcached: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374353 (https://phabricator.wikimedia.org/T171704) [12:40:36] (03PS1) 10Giuseppe Lavagetto: statistics::wmde: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374538 (https://phabricator.wikimedia.org/T171704) [12:43:18] (03PS4) 10Giuseppe Lavagetto: ganglia::gmetad::rrdcached: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374353 (https://phabricator.wikimedia.org/T171704) [12:44:08] (03CR) 10Giuseppe Lavagetto: [C: 032] ganglia::gmetad::rrdcached: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374353 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [12:46:42] (03PS2) 10Giuseppe Lavagetto: statistics::wmde: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374538 (https://phabricator.wikimedia.org/T171704) [12:55:14] (03CR) 10Giuseppe Lavagetto: [C: 032] statistics::wmde: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374538 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [12:59:59] jouncebot: next [13:00:00] In 0 hour(s) and 0 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1300) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1300). Please do the needful. [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:11] I'm here [13:00:18] I can SWAT today! [13:00:24] (03PS4) 10Ottomata: webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [13:00:26] (03PS6) 10Ottomata: webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [13:00:58] Urbanecm: ok, looks like there is only one patch, will merge and deploy [13:01:03] Great [13:01:07] I'll ping you when it's at mwdebug [13:03:21] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374071 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:04:53] (03Merged) 10jenkins-bot: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374071 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:06:00] (03CR) 10jenkins-bot: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374071 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:06:25] (03CR) 10Ottomata: [C: 032] webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [13:06:35] Urbanecm: the commit is at mwdebug1002 [13:07:38] ack [13:08:13] zeljkof, please deploy [13:08:30] Urbanecm: deploying [13:08:37] thx [13:09:09] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:374071|Add several HD logos (T150618)]] (duration: 00m 43s) [13:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:22] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [13:10:17] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:374071|Add several HD logos (T150618)]] (duration: 00m 43s) [13:10:23] Urbanecm: deployed [13:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:37] thank you [13:12:33] looks like that is all [13:12:37] !log EU SWAT finished [13:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:10] 10Operations, 10Discovery, 10Discovery-Analysis, 10Maps, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3561943 (10Gehel) Deployment is scheduled for Thursday August 31. [13:17:02] PROBLEM - Nginx local proxy to apache on mw1205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.008 second response time [13:17:42] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [13:18:01] RECOVERY - Nginx local proxy to apache on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.043 second response time [13:18:02] PROBLEM - HHVM rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [13:18:21] PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [13:19:02] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 74416 bytes in 0.191 second response time [13:19:11] (03PS1) 10Giuseppe Lavagetto: zuul: remove configfile define [puppet] - 10https://gerrit.wikimedia.org/r/374541 (https://phabricator.wikimedia.org/T171704) [13:19:21] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.072 second response time [13:19:40] weird [13:19:42] RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.090 second response time [13:19:51] PROBLEM - Apache HTTP on mw1282 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [13:20:24] (03PS5) 10Ottomata: webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [13:20:49] (03CR) 10jerkins-bot: [V: 04-1] webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [13:20:51] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.064 second response time [13:21:25] Aug 29 13:18:48 mw1276 systemd[1]: hhvm.service: main process exited, code=killed, status=11/SEGV [13:21:31] no bueno [13:21:43] SWAT was completed not long ago, could be related? [13:22:14] (03PS6) 10Ottomata: webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [13:22:32] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3481107 (10Krenair) Was the maintain-views step not performed? ```MariaDB [hiwikiversity_p]> show tables; Empty set (0.00 sec)``` [13:24:57] (03CR) 10Ottomata: [C: 032] webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [13:25:25] (03CR) 10Ottomata: [C: 032] "I refactored a little to use hiera lookups from the role class, rather than the modules, and fixed up some parameters that I think would h" [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [13:25:38] (03PS7) 10Ottomata: webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [13:25:40] (03CR) 10Ottomata: [V: 032 C: 032] webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [13:26:10] volans: from the stacktrace it seems something related to AbuseFilter, but I have no idea if it could be related [13:26:11] PROBLEM - HHVM rendering on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [13:26:33] PROBLEM - Apache HTTP on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [13:26:57] zeljkof: hi :) [13:27:11] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 74416 bytes in 0.816 second response time [13:27:25] (03PS2) 10Giuseppe Lavagetto: zuul: remove configfile define [puppet] - 10https://gerrit.wikimedia.org/r/374541 (https://phabricator.wikimedia.org/T171704) [13:27:32] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.088 second response time [13:27:47] same stacktrace for --^ [13:28:00] elukey: hi [13:28:15] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3561980 (10Reedy) >>! In T171829#3553943, @Marostegui wrote: > The blocker is fixed and so is this one too: > ``` > mysql:root@localhost [hiwikiversity_p]> show t... [13:29:07] seems that the only merged patch was https://gerrit.wikimedia.org/r/#/c/374071/ in the last SWAT [13:29:21] <_joe_> elukey: did you check the caches sizes? [13:29:25] zeljkof: after the deployment there seems to be some api hosts reporting segfaults for HHVM and PHP stacktraces containing AbuseFilter [13:29:31] _joe_ nope [13:29:39] <_joe_> 1 sec [13:29:58] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3561986 (10Marostegui) >>! In T168765#3561969, @Krenair wrote: > Was the maintain-views step not completely performed? > ```MariaDB [hiwikiversity_p]> show tables... [13:30:00] <_joe_> zeljkof: please prepare the revert for any change that could be related to Abusefilter [13:30:04] <_joe_> but do not commit it [13:30:16] zeljkof: is it normal that the keys are repeated? [13:30:31] ty marostegui [13:30:44] I was indeed looking at 1003 [13:31:01] it is fixed now :) [13:31:04] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3561990 (10Marostegui) >>! In T168765#3561986, @Marostegui wrote: >>>! In T168765#3561969, @Krenair wrote: >> Was the maintain-views step not completely performed... [13:31:46] many of the lines added in that patch were already there [13:31:58] <_joe_> wat? [13:32:03] <_joe_> let's revert, yes? [13:32:06] https://gerrit.wikimedia.org/r/#/c/374071/3/wmf-config/InitialiseSettings.php [13:33:04] works, ty again [13:33:14] <_joe_> yeah don't think that's relevant [13:33:21] volans, _joe_: oops, I have deployed only changes to logos, did something break? [13:33:38] <_joe_> I think it's unrelated [13:33:44] <_joe_> but let me dig a bit deeper [13:33:49] Someone probably wrote a shitty rule and/or enabled it [13:33:55] Got the stack trace anywhere? [13:34:08] it's just duplicated keys in the array, that sounds wrong but unrelated [13:34:16] <_joe_> Reedy: I guess on mw1279 you could have one under /var/core [13:34:17] Reedy: yep, it is in /var/log/hhvm/ (last stacktrace.etc..) [13:34:24] I did not even notice duplicate keys :( my mistake [13:34:27] <_joe_> in /var/log/hhvm, yeah [13:34:32] <_joe_> after it has restarted [13:34:38] I'm around, let me know if I need to do anything [13:34:43] <_joe_> but yeah zeljkof that's surely unrelated [13:34:56] Almost certainly [13:35:03] I think PHP doesn't even complain about it [13:35:11] probably just override them [13:35:19] IIRC, it's the last one that wins [13:35:20] elukey: On what host? [13:35:26] mw1279 [13:36:13] That looks like an array of usernames [13:36:13] one of the last failed ones, mw1279 should be good [13:36:58] slow query: SELECT /* GenderCache::doQuery/ApiQueryAllPages::run */ user_name,up_value FROM `user` LEFT JOIN `user_properties` ON ((user_id = up_user) AND up_property = 'gender') WHERE user_name IN (' [13:37:39] oh, stacktrace [13:38:20] (03PS1) 10Faidon Liambotis: Add sandbox1-esams and ripe-atlas-esams [dns] - 10https://gerrit.wikimedia.org/r/374545 [13:38:31] PROBLEM - Apache HTTP on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [13:38:45] <_joe_> it's happening again [13:38:58] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3561996 (10Papaul) @elukey i think i will take your advice to burn mw2256 down lol. Here is want I want for you to do for me. Configure the system to generate a kernel crash dump. When the syste... [13:39:01] PROBLEM - HHVM rendering on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [13:39:11] 10Operations, 10Ops-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T174316#3561997 (10Ottomata) @robh, can you advise here? What needs to be done for a LDAP request for someone who has signed an NDA? Is there a place where I can look up the signed NDA to verify? If I do, can I j... [13:39:12] PROBLEM - Nginx local proxy to apache on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [13:39:15] could be that this list of usernames is too long and makes hhvm crash? [13:39:26] <_joe_> which list? [13:39:27] and the user is re-trying and at every retry one HHVM dies? [13:39:31] see error.log [13:39:44] hhvm shouldn't be dying from a slow mysql query... [13:39:45] (03CR) 10Faidon Liambotis: [C: 032] Add sandbox1-esams and ripe-atlas-esams [dns] - 10https://gerrit.wikimedia.org/r/374545 (owner: 10Faidon Liambotis) [13:39:55] Reedy: not for the slowness, but for the size of it [13:40:02] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 74418 bytes in 1.871 second response time [13:40:11] my wild guess [13:40:12] RECOVERY - Nginx local proxy to apache on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.125 second response time [13:40:31] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.021 second response time [13:41:43] AbuseFilter doesn't use the API recent changes... So I'd suggest the two are unrelated [13:41:44] jenkins having issues? [13:42:15] paravoid: not that I know, why? [13:42:23] nah, just took a while [13:45:30] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3562003 (10Papaul) @madhuvishy here is what I am about to setup on Labstore2001 3xRAID10 of 8 disks per logical/virtua... [13:50:50] (03PS3) 10Giuseppe Lavagetto: zuul: remove configfile define [puppet] - 10https://gerrit.wikimedia.org/r/374541 (https://phabricator.wikimedia.org/T171704) [13:51:42] (03CR) 10Giuseppe Lavagetto: [C: 032] zuul: remove configfile define [puppet] - 10https://gerrit.wikimedia.org/r/374541 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [13:58:58] (03CR) 10Reedy: "As reported on IRC, a followup of this needs making that fixes the duplicate array keys" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374071 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:59:03] (03PS1) 10Faidon Liambotis: Add mgmt IPs for cr3-esams, new MX480 [dns] - 10https://gerrit.wikimedia.org/r/374548 [14:00:09] (03CR) 10Faidon Liambotis: [C: 032] Add mgmt IPs for cr3-esams, new MX480 [dns] - 10https://gerrit.wikimedia.org/r/374548 (owner: 10Faidon Liambotis) [14:06:10] (03PS5) 10Gehel: elasticsearch - switch to using logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373509 [14:08:29] (03PS6) 10Gehel: elasticsearch - switch to using logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373509 [14:09:16] (03CR) 10Gehel: [C: 032] elasticsearch - switch to using logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373509 (owner: 10Gehel) [14:12:58] (03PS3) 10Gehel: apertium - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373510 [14:13:07] (03PS4) 10Gehel: apertium - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373510 [14:13:12] (03CR) 10jerkins-bot: [V: 04-1] apertium - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373510 (owner: 10Gehel) [14:15:46] (03PS3) 10Gehel: base - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373515 [14:16:00] (03CR) 10jerkins-bot: [V: 04-1] base - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373515 (owner: 10Gehel) [14:16:12] (03CR) 10Gehel: apertium - switch to logrotate::rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/373510 (owner: 10Gehel) [14:16:24] (03PS4) 10Gehel: base - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373515 [14:17:01] (03CR) 10Gehel: "The convention in puppet is to use snake case, I'd prefer to follow it..." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/373515 (owner: 10Gehel) [14:18:36] (03PS3) 10Gehel: camus - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373516 [14:18:45] (03PS4) 10Gehel: camus - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373516 [14:23:56] (03CR) 10Volans: [C: 031] "LGTM, being the base module please double check it with the puppet compiler to be sure ;)" [puppet] - 10https://gerrit.wikimedia.org/r/373515 (owner: 10Gehel) [14:24:13] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3562108 (10Cmjohnson) moved the fibers to cr1/2 to xe-3/1/7 swapped the cable with fiber connecting pfw-3a and pfw-3b (xe-1/0/17) Moved connections pfw-3a/b and fasw-c1a/... [14:32:33] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3562138 (10ovasileva) [14:33:28] !log Shutdown db1055 to replace its BBU - T174265 [14:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:43] T174265: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265 [14:33:45] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2799650 (10ovasileva) @GWicke - timeline now updated in task description. OCG switching will be done by the en... [14:35:40] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3562150 (10Cmjohnson) HP requested the AHS log, uploaded the log to their system. Waiting on their response. Only working with 1006 at the moment. [14:47:31] PROBLEM - Host db1055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:48:13] ^ that is expected [14:51:18] !log drop log.MobileWebUIClickTracking_10742159_15423246 from db1047 (archived on HDFS) - T172322 [14:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:31] T172322: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322 [14:52:21] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2074714 [14:52:41] RECOVERY - Host db1055.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [14:54:31] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2027261 [14:56:01] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3483663 (10chasemp) >>! In T168765#3561990, @Marostegui wrote: >>>! In T168765#3561986, @Marostegui wrote: >>>>! In T168765#3561969, @Krenair wrote: >>> Was the m... [14:56:32] (03PS3) 10Filippo Giunchedi: install_server: add partman for cassandra JBOD [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) [15:03:03] PROBLEM - Host cp3035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:03:03] PROBLEM - Host cp3034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:03:24] 10Operations, 10ops-eqiad, 10DBA: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265#3562203 (10Marostegui) The BBU has been replaced and looks good: ``` root@db1055:/home/marostegui# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 3918... [15:03:52] PROBLEM - Host cp3033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:06:28] mark paravoid ^ 3x cp down, expected? [15:06:33] esams cp that is [15:06:52] PROBLEM - Host cp3039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:06:56] mark is on-site, but that's not expected no [15:07:00] it is [15:07:15] these are in production, I think? [15:07:23] it's mgmt [15:07:27] oh duh [15:07:32] i unplugged one link to remove a switch [15:07:37] doh, my bad [15:07:38] duh sorry [15:07:44] nice that we can see that these days ;) [15:08:11] RECOVERY - Host cp3035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.72 ms [15:08:12] RECOVERY - Host cp3034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.75 ms [15:08:27] I have to train eyes to read the full hostname now heh [15:09:01] RECOVERY - Host cp3033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.66 ms [15:12:02] RECOVERY - Host cp3039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.41 ms [15:12:29] 10Operations, 10Ops-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T174316#3562234 (10Samtar) If at all helpful, the NDA sent through "cobblestone" was named "WMF-Sam Tarling Volunteer NDA January 2017" and has a reference number `832b7db5-a494-43ac-8895-c3ba6837e18d` [15:14:57] (03PS1) 10Giuseppe Lavagetto: puppetmaster::passenger: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374553 (https://phabricator.wikimedia.org/T171704) [15:14:57] (03PS1) 10Giuseppe Lavagetto: labspuppetmaster: fix array interpolation in strings [puppet] - 10https://gerrit.wikimedia.org/r/374554 (https://phabricator.wikimedia.org/T171704) [15:17:11] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3562267 (10madhuvishy) @Papaul Yup that's perfect, thanks! [15:20:29] 10Operations, 10LDAP-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T174316#3562274 (10RobH) 05Open>03Resolved a:03RobH So Ldap requests are not the same as ops requests, a few things has to happen: * swapping this from #ops-access-requests to #ldap-access-requests - done *... [15:20:49] 10Operations, 10LDAP-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T174316#3562278 (10RobH) a:05RobH>03None [15:24:14] (03CR) 10Eevans: "> > So this is ~50G for the data raid-1? If so, that seems to be" [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [15:31:37] (03PS19) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [15:32:37] (03PS1) 10Alexandros Kosiaris: kubelet: Remove configure-cbr0 parameter [puppet] - 10https://gerrit.wikimedia.org/r/374556 (https://phabricator.wikimedia.org/T170119) [15:33:42] (03PS2) 10Alexandros Kosiaris: kubelet: Remove configure-cbr0 parameter [puppet] - 10https://gerrit.wikimedia.org/r/374556 (https://phabricator.wikimedia.org/T170119) [15:35:24] (03CR) 10Alexandros Kosiaris: "Added WMCS people so they are aware this is going out. I expect it to not cause any problems as it is a tested noop in toollabs. I am prob" [puppet] - 10https://gerrit.wikimedia.org/r/374556 (https://phabricator.wikimedia.org/T170119) (owner: 10Alexandros Kosiaris) [15:48:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3562348 (10ayounsi) [15:48:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3554735 (10ayounsi) [15:49:31] (03CR) 10Hashar: "Booted a new instance and it works:" [puppet] - 10https://gerrit.wikimedia.org/r/369873 (owner: 10Hashar) [15:54:07] !log T169939: Decommission Cassandra: restbase2005-c.codfw.wmnet [15:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:19] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [15:54:31] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 422 [15:59:45] 10Operations, 10ops-eqiad, 10Release-Engineering-Team: tin has a failing hdd - https://phabricator.wikimedia.org/T174449#3562401 (10RobH) [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1600). [16:00:58] seems no patches scheduled [16:08:21] (03PS1) 10Jcrespo: mariadb: Implement regular logical backups using mydumper [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) [16:08:46] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Implement regular logical backups using mydumper [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [16:12:39] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3562459 (10chasemp) a:03madhuvishy [16:13:11] 10Operations, 10Release-Engineering-Team, 10hardware-requests: eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#3562461 (10RobH) [16:14:06] (03PS2) 10Jcrespo: mariadb: Implement regular logical backups using mydumper [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) [16:14:48] (03Abandoned) 10Jcrespo: [WIP]mariadb: First attempt at a mydumper-based dump script [puppet] - 10https://gerrit.wikimedia.org/r/371944 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [16:16:23] (03CR) 10Jcrespo: "The script is mostly empty, but this would be a skeleton of the puppet code, with minimal run of mydumper (but not much functionality)." [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [16:28:42] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:35:01] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3562523 (10madhuvishy) @Marostegui We talked about this today in our meeting, and think that since we don't have significant user traffic moved over from 1001... [16:38:09] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3562543 (10bd808) I vote we close this as "resolved" with a note that 1001/3 have not been rebooted because of the fear of catastrophic hardware failure and t... [16:38:15] bd808 so... do you know how I would get access to non _p databases? https://phabricator.wikimedia.org/T170717#3560460 [16:39:11] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [16:39:13] davidwbarratt: what db server are you trying to do this on? [16:39:39] bd808 db1095 [16:41:52] PROBLEM - Host labstore2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:42:44] hmmm.... db1095 is a sanatarium host. I don't think that's where you should be going for raw data. [16:43:17] Just run `sql enwiki` from terbium and you should be connected to a full replica server [16:43:38] bd808 ah! thanks! [16:44:09] bd808 and that worked! thanks! [16:44:20] yw! [16:45:08] the sanatarium boxes are special. they filter out the private data that can't go into the cloud services replicas [16:48:11] bd808 ah. got it. thanks! [16:51:46] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3562604 (10Papaul) @madhuvishy done [16:53:10] (03CR) 10Smalyshev: [C: 031] wdqs - logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [16:53:21] RECOVERY - Host labstore2002 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [16:54:12] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3562639 (10Marostegui) >>! In T168584#3562523, @madhuvishy wrote: > @Marostegui We talked about this today in our meeting, and think that since we don't have... [16:57:27] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562647 (10RobH) [16:57:56] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: send wdqs logs to logstash - https://phabricator.wikimedia.org/T172710#3562695 (10EBernhardson) >>! In T172710#3561315, @Gehel wrote: > Logs are now sent to logstash, but the "host" field isn't set correctly (its value i... [16:59:56] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: send wdqs logs to logstash - https://phabricator.wikimedia.org/T172710#3562704 (10Gehel) Oh, I was expecting %{HOSTNAME} to be interpreted by logstash itself, not as a ref in the same document. There is something about H... [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1700). [17:00:17] Nothing for ORES! [17:10:07] 10Operations, 10Release-Engineering-Team, 10hardware-requests: eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#3562755 (10RobH) Please note that there has been some IRC discussion. The relevance of moving deployment to ganeti was discussed on T144578. Additionally, @bd8... [17:12:09] (03CR) 10GWicke: "I am a bit concerned about doubling the write bandwidth consumed by journal writes. IIRC Cassandra journals contain full data, so this is " [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [17:13:47] 10Operations, 10Release-Engineering-Team, 10hardware-requests: eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#3562461 (10demon) >>! In T174452#3562755, @RobH wrote: > I'd advise we try out the spare system with the SATA disks and see how well it works. Its a higher perf... [17:20:00] !log Disabled oathauth for KartikMistry on wikitech [17:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:43] 10Operations, 10Traffic: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3561810 (10BBlack) Are the non-icmp graphs somehow LVS-specific? My past impression of such graphs is that they aren't, and it just happens to be the case that the bulk of the LVS hos... [17:21:17] (03PS1) 10BBlack: browsersec: update es translation [puppet] - 10https://gerrit.wikimedia.org/r/374584 [17:21:19] (03PS1) 10BBlack: browsersec: update ar translation [puppet] - 10https://gerrit.wikimedia.org/r/374585 [17:26:21] PROBLEM - MegaRAID on db1048 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [17:26:52] PROBLEM - Host mw1228 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:45] (03PS1) 10Chad: group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374587 [17:29:27] 10Operations, 10cloud-services-team: Reboots of cloud servers - https://phabricator.wikimedia.org/T168445#3562862 (10madhuvishy) [17:29:31] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3562860 (10madhuvishy) 05Open>03Resolved 1001/3 have not been rebooted because of the fear of catastrophic hardware failure and their impending decomm. [17:29:38] (03CR) 10Chad: [C: 04-2] "not 4 now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374587 (owner: 10Chad) [17:30:09] !log demon@tin Started scap: bootstrap wmf.16 [17:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:26] 10Operations, 10ops-eqiad: Broken disk on mw1228 - https://phabricator.wikimedia.org/T168613#3562868 (10Cmjohnson) The disks have been swapped needs re-install [17:30:38] 10Operations, 10Analytics: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3562875 (10Ottomata) [17:32:02] RECOVERY - Host mw1228 is UP: PING WARNING - Packet loss = 44%, RTA = 0.28 ms [17:32:32] (03CR) 10BBlack: [C: 032] browsersec: update es translation [puppet] - 10https://gerrit.wikimedia.org/r/374584 (owner: 10BBlack) [17:33:10] I’d like to run lsof on scb1002 to troubleshoot an ORES issue. Is there any way to get this sudo permission temporarily, or can someone with root paste me the results? <— akosiaris [17:35:41] (03PS1) 10Reedy: Add techconductwiki to wgCanonicalServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374589 (https://phabricator.wikimedia.org/T174447) [17:35:55] jouncebot: next [17:35:55] In 1 hour(s) and 24 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1900) [17:35:57] jouncebot: now [17:35:57] For the next 0 hour(s) and 24 minute(s): Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1700) [17:36:05] (03CR) 10Reedy: [C: 032] Add techconductwiki to wgCanonicalServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374589 (https://phabricator.wikimedia.org/T174447) (owner: 10Reedy) [17:36:28] awight, why lsof on scb1002? [17:36:37] We've been running capacity tests on ores100* [17:37:25] I was trying to get a sense of FD consumption during regular load [17:37:44] You might be right that looking at the stressed machines would be more interesting, though [17:37:48] (03Merged) 10jenkins-bot: Add techconductwiki to wgCanonicalServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374589 (https://phabricator.wikimedia.org/T174447) (owner: 10Reedy) [17:37:56] cool—and if we have root, all the better. [17:37:58] (03CR) 10jenkins-bot: Add techconductwiki to wgCanonicalServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374589 (https://phabricator.wikimedia.org/T174447) (owner: 10Reedy) [17:38:53] halfak: I have no root there. Mind if I request? [17:39:11] +1 [17:39:32] root might be overkill but we should be able to add it to the wheel stuff. [17:40:19] lol [17:40:27] * halfak can't view the sudoers file on scb1002 [17:40:52] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562940 (10RobH) [17:45:18] (03PS2) 10BBlack: browsersec: update ar translation [puppet] - 10https://gerrit.wikimedia.org/r/374585 (https://phabricator.wikimedia.org/T163251) [17:45:57] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562979 (10RobH) So I realized that: d-i partman-auto/choose_recipe es was in the recipe, and isn't needed since it doe... [17:46:31] (03CR) 10BBlack: [C: 032] browsersec: update ar translation [puppet] - 10https://gerrit.wikimedia.org/r/374585 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [17:47:25] (03PS1) 10Awight: Let scoring platform team run "lsof" for diagnostics [puppet] - 10https://gerrit.wikimedia.org/r/374593 (https://phabricator.wikimedia.org/T174402) [17:47:39] akosiaris: ^ I would love to have that permission [17:50:43] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3563013 (10leila) >>! In T163251#3561653, @Johan wrote: >... [17:52:07] 10Operations, 10Cloud-Services: Register to Wikitech - https://phabricator.wikimedia.org/T174469#3563019 (10Vacio) [17:57:15] 10Operations, 10ORES, 10Scoring-platform-team-Backlog, 10Patch-For-Review, 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3563049 (10awight) #operations I would like lsof permissions on the ORES boxes, https://gerrit.wiki... [17:57:18] 10Operations, 10Cloud-Services: Register to Wikitech - https://phabricator.wikimedia.org/T174469#3563019 (10madhuvishy) @Vacio Could you please elaborate on what the problem is? Did you try signing up to wikitech and did you run into an error? If so what? You can create a wikitech account at https://wikitech.w... [17:59:07] (03CR) 10Hoo man: [C: 04-1] "-1 for now, until concerns about the link item client widget have been looked into (see T174345#3558741)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374328 (https://phabricator.wikimedia.org/T174345) (owner: 10Urbanecm) [17:59:24] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3563103 (10Eevans) [18:11:53] 10Operations, 10Dumps-Generation, 10Patch-For-Review: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3563181 (10madhuvishy) @ArielGlenn Sounds good, I would push towards a larger window of atleast 2 hours - 45 minutes to an hour for 3 rsyncs + some cleanup se... [18:15:29] !log T169939: Decommission Cassandra: restbase1008-a.eqiad.wmnet [18:15:36] !log demon@tin Finished scap: bootstrap wmf.16 (duration: 45m 27s) [18:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:42] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [18:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:48] 10Operations, 10Release-Engineering-Team, 10Category, 10Epic, 10Services (watching): FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453#3563197 (10JAufrecht) Adding an extra tag to support Phlogiston reporting experiments. [18:17:41] !log depooling restbase1008,1009,1010,2003,2005 while cluster reshaping is going on [18:17:48] (03PS2) 10Urbanecm: Restrict merging rights to autoconfirmed users on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374328 (https://phabricator.wikimedia.org/T174345) [18:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:01] (03PS4) 10Urbanecm: Enable SandboxLink on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372531 (https://phabricator.wikimedia.org/T173054) [18:19:44] !log attempting firmware upgrade on scs-a8-eqiad [18:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:57] (03PS1) 10BBlack: Varnish: reload VCL on error page changes [puppet] - 10https://gerrit.wikimedia.org/r/374601 [18:21:59] (03PS1) 10BBlack: browsersec: add fa translation [puppet] - 10https://gerrit.wikimedia.org/r/374602 (https://phabricator.wikimedia.org/T163251) [18:25:23] (03CR) 10BBlack: [C: 032] Varnish: reload VCL on error page changes [puppet] - 10https://gerrit.wikimedia.org/r/374601 (owner: 10BBlack) [18:25:27] (03CR) 10BBlack: [C: 032] browsersec: add fa translation [puppet] - 10https://gerrit.wikimedia.org/r/374602 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [18:27:06] 10Operations, 10DC-Ops, 10netops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563233 (10RobH) [18:29:15] (03PS1) 10BBlack: browsersec: add missing dir=rtl for fa [puppet] - 10https://gerrit.wikimedia.org/r/374604 (https://phabricator.wikimedia.org/T163251) [18:29:55] (03CR) 10BBlack: [C: 032] browsersec: add missing dir=rtl for fa [puppet] - 10https://gerrit.wikimedia.org/r/374604 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [18:30:36] 10Operations, 10DC-Ops, 10netops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563264 (10RobH) This was triggered by an error that both @ayounsi and I experienced attempting to connect to https://scs-oe11-esams.mgmt.esams.wmnet/ On FF and Chrome, it gives the error: ``` Sec... [18:32:13] 10Operations, 10DC-Ops, 10netops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563267 (10RobH) [18:36:21] RECOVERY - MegaRAID on db1048 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [18:36:39] godog: if I understand https://phabricator.wikimedia.org/T169860 correctly, new tests should be defined within the prometheus module itself? (This contrasted with icinga where the tests are defined for the host-to-be-tested rather than on the host-that-runs-the-tests) [18:48:07] 10Operations, 10DC-Ops, 10netops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563323 (10RobH) [18:48:27] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Patch-For-Review: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3563325 (10Sharvaniharan) 05Resolved>03Open @herron I accidentally overwrote my ssh public key . I am so sorry . Attached is my new... [18:53:58] 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563341 (10RobH) [18:58:37] (03PS1) 10BBlack: browsersec: re-order languages slightly [puppet] - 10https://gerrit.wikimedia.org/r/374605 (https://phabricator.wikimedia.org/T163251) [19:00:05] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T1900). [19:00:34] (03PS1) 10ArielGlenn: copy of completed dump files plus metadata from dumpsdata to web server [puppet] - 10https://gerrit.wikimedia.org/r/374606 (https://phabricator.wikimedia.org/T169849) [19:00:36] (03CR) 10BBlack: [C: 032] browsersec: re-order languages slightly [puppet] - 10https://gerrit.wikimedia.org/r/374605 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [19:00:50] (03CR) 10jerkins-bot: [V: 04-1] copy of completed dump files plus metadata from dumpsdata to web server [puppet] - 10https://gerrit.wikimedia.org/r/374606 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [19:01:22] !log ppchelko@tin Started deploy [restbase/deploy@7f2e55f]: Update CXServer endpoints config [19:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:36] !log restarting main-eqiad -> analytics kafka mirror maker processes on analytics kafka brokers, something is not working... [19:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:09] 10Operations, 10Dumps-Generation, 10Patch-For-Review: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3563386 (10ArielGlenn) I'm hoping to avoid the bwlimit option, I use this in our current setup but it's a hard cap even when there's not use of the interface... [19:08:11] !log ppchelko@tin Finished deploy [restbase/deploy@7f2e55f]: Update CXServer endpoints config (duration: 06m 48s) [19:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:37] (03CR) 10Chad: [C: 032] group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374587 (owner: 10Chad) [19:09:51] (03PS1) 10RobH: new ssh pub key for Sharvani Haran [puppet] - 10https://gerrit.wikimedia.org/r/374607 (https://phabricator.wikimedia.org/T173886) [19:10:10] (03Merged) 10jenkins-bot: group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374587 (owner: 10Chad) [19:10:21] (03CR) 10jenkins-bot: group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374587 (owner: 10Chad) [19:10:26] ottomata: you are clinic this week right? [19:10:35] you forgot to update topic in here, i shall now =] [19:12:59] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.16 [19:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:17] (03CR) 10Sharvaniharan: [C: 031] new ssh pub key for Sharvani Haran [puppet] - 10https://gerrit.wikimedia.org/r/374607 (https://phabricator.wikimedia.org/T173886) (owner: 10RobH) [19:14:32] (03CR) 10RobH: [C: 032] new ssh pub key for Sharvani Haran [puppet] - 10https://gerrit.wikimedia.org/r/374607 (https://phabricator.wikimedia.org/T173886) (owner: 10RobH) [19:16:50] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:17:56] haha, i did, thanks robh [19:18:00] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Patch-For-Review: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3563500 (10RobH) 05Open>03Resolved fixed and new pubkey is live. change was confirmed as valid via user update into phab, as well... [19:18:01] (03PS1) 10Ottomata: Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610 [19:18:05] no worries [19:19:18] (03PS2) 10Ottomata: Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610 [19:19:49] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Patch-For-Review: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3563513 (10Sharvaniharan) Thank you so much @RobH . You are a lifesaver! [19:21:00] (03CR) 10Ppchelko: [C: 031] Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610 (owner: 10Ottomata) [19:21:59] Pchelolo: i think i also need to set max.request.size in the mirror maker producer configs [19:22:01] 10Operations, 10hardware-requests, 10Release-Engineering-Team (Watching / External): eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#3563524 (10greg) [19:22:04] 10Operations, 10ops-eqiad, 10Release-Engineering-Team (Watching / External): tin has a failing hdd - https://phabricator.wikimedia.org/T174449#3563527 (10greg) [19:26:39] (03PS3) 10Ottomata: Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610 [19:28:26] (03PS4) 10Ottomata: Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610 [19:29:20] (03PS5) 10Ottomata: Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610 [19:29:31] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:31:27] (03CR) 10Ottomata: "Ok, looks good: https://puppet-compiler.wmflabs.org/compiler02/7650/" [puppet] - 10https://gerrit.wikimedia.org/r/374610 (owner: 10Ottomata) [19:31:29] (03CR) 10Ottomata: [C: 032] Synchronize message.max.bytes between all kafka clusters and producers [puppet] - 10https://gerrit.wikimedia.org/r/374610 (owner: 10Ottomata) [19:34:16] !log restarting all kafka brokers and mirror maker processes to apply https://gerrit.wikimedia.org/r/#/c/374610/ [19:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:38] (03PS1) 10Cmjohnson: Removing mgmt dns entries for mw1170-1179 T167130 [dns] - 10https://gerrit.wikimedia.org/r/374614 [19:36:42] ottomata: Do you have a minute for https://gerrit.wikimedia.org/r/#/c/374593/ ? [19:36:45] (03PS2) 10Cmjohnson: Removing mgmt dns entries for mw1170-1179 T167130 [dns] - 10https://gerrit.wikimedia.org/r/374614 [19:36:57] awight: in the middle of restarting stuff to fix broken eventstreams [19:37:00] will look in a bit [19:37:06] thanks! [19:37:38] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for mw1170-1179 T167130 [dns] - 10https://gerrit.wikimedia.org/r/374614 (owner: 10Cmjohnson) [19:38:11] PROBLEM - Check systemd state on mw1259 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:41:53] ^ mw1259 fixed [19:42:04] remnant of reimage process [19:42:11] RECOVERY - Check systemd state on mw1259 is OK: OK - running: The system is fully operational [19:42:20] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[jobchron] [19:44:21] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [19:47:35] !log T169939: Decommission Cassandra: restbase1008-b.eqiad.wmnet [19:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:46] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [19:50:51] PROBLEM - cassandra-a service on restbase2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [19:51:00] PROBLEM - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.134 and port 9042: Connection refused [19:51:10] PROBLEM - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:51:20] PROBLEM - Check systemd state on restbase2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:51:27] looking ^^^ [19:53:05] ACKNOWLEDGEMENT - Check systemd state on restbase2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans Decommissioned host [19:53:05] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.134 and port 9042: Connection refused eevans Decommissioned host [19:53:05] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned host [19:53:05] ACKNOWLEDGEMENT - cassandra-a service on restbase2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed eevans Decommissioned host [19:55:49] (03PS3) 10Ppchelko: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [19:56:31] (03PS1) 10Ottomata: Set fetch.message.max.bytes for mirror maker consumers [puppet] - 10https://gerrit.wikimedia.org/r/374619 [19:57:18] (03CR) 10jerkins-bot: [V: 04-1] JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [19:58:39] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/7651/" [puppet] - 10https://gerrit.wikimedia.org/r/374619 (owner: 10Ottomata) [19:58:42] (03CR) 10Ottomata: [C: 032] Set fetch.message.max.bytes for mirror maker consumers [puppet] - 10https://gerrit.wikimedia.org/r/374619 (owner: 10Ottomata) [19:58:48] (03PS4) 10Ppchelko: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [20:12:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3563777 (10ayounsi) [20:13:19] 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3563783 (10Cmjohnson) mw1307-1328 are racked, idrac setup, mgmt dns and switch ports configured. [x] receive in system on procurement task T159963 [x] bios/drac/serial setup/test... [20:14:08] 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3563786 (10Cmjohnson) [20:22:31] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:10] PROBLEM - trendingedits endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:21] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [20:23:41] (03CR) 10Eevans: "> I am a bit concerned about doubling the write bandwidth consumed by" [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [20:24:01] RECOVERY - trendingedits endpoints health on scb1001 is OK: All endpoints are healthy [20:28:59] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3563854 (10RobH) Ok, so putting the recipe info to ignore noswap requires: partman-basicfilesystems partman-basicfilesystems/no_... [20:29:02] (03PS2) 10Rush: openstack: remove legacy firewall rules for controller [puppet] - 10https://gerrit.wikimedia.org/r/374424 (https://phabricator.wikimedia.org/T171494) [20:29:05] (03PS2) 10Rush: openstack: remove redis replication rule [puppet] - 10https://gerrit.wikimedia.org/r/374427 (https://phabricator.wikimedia.org/T171494) [20:29:58] 10Operations, 10Mail: mail.wikimedia.org SSL cert expiring Mon 23 Oct 2017 - https://phabricator.wikimedia.org/T174081#3563870 (10herron) Looking into how to renew this using let's encrypt. The globalsign cert used today is configured with attributes: CN=mail.wikimedia.org SAN=cert mail.wikimedia.org, m... [20:30:40] (03CR) 10Rush: [C: 032] openstack: remove legacy firewall rules for controller [puppet] - 10https://gerrit.wikimedia.org/r/374424 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:31:57] !log ppchelko@tin Started deploy [changeprop/deploy@ed0fadc]: Release a redis-based deduplicator in test mode [20:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:40] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:33:40] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational [20:33:48] (03CR) 10Rush: [C: 032] openstack: remove redis replication rule [puppet] - 10https://gerrit.wikimedia.org/r/374427 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:36:26] !log ppchelko@tin Finished deploy [changeprop/deploy@ed0fadc]: Release a redis-based deduplicator in test mode (duration: 04m 28s) [20:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:21] !log ppchelko@tin Started deploy [changeprop/deploy@a57c79d]: Release a redis-based deduplicator in test mode. Attempt 2 [20:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:38] !log ppchelko@tin Finished deploy [changeprop/deploy@a57c79d]: Release a redis-based deduplicator in test mode. Attempt 2 (duration: 01m 17s) [20:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:45] !log bounce varnish on cp1074 / cp1099 / cp1072 - mailbox lag [20:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:01] (03PS2) 10Ayounsi: Icinga: Add basic monitoring for routers' active RE [puppet] - 10https://gerrit.wikimedia.org/r/374435 (https://phabricator.wikimedia.org/T174397) [20:52:30] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [20:54:50] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:54:52] (03CR) 10Ayounsi: [C: 032] Icinga: Add basic monitoring for routers' active RE [puppet] - 10https://gerrit.wikimedia.org/r/374435 (https://phabricator.wikimedia.org/T174397) (owner: 10Ayounsi) [20:54:58] (03PS3) 10Ayounsi: Icinga: Add basic monitoring for routers' active RE [puppet] - 10https://gerrit.wikimedia.org/r/374435 (https://phabricator.wikimedia.org/T174397) [20:56:10] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [20:57:49] (03CR) 10Eevans: "> > I am a bit concerned about doubling the write bandwidth consumed" [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [21:05:09] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:05:29] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:05:29] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [21:06:32] andrewbogott: fullstack kicked a few instances in a row^ [21:06:41] dang [21:06:46] seems weird puppet run related, it's doing that same thing w/ not being able to parse the puppet run output [21:07:11] File "/usr/local/sbin/nova-fullstack", line 562, in [21:07:11] main() [21:07:11] File "/usr/local/sbin/nova-fullstack", line 536, in main [21:07:13] for k, v in puppetrun[d].iteritems(): [21:07:15] KeyError: 'changes' [21:07:45] andrewbogott: I did merge those 2 rule cleanups although i can't for the life of figure out how it would be related fyi [21:08:58] chasemp: where did you get that paste? Is it in the boot log on horizon? [21:09:12] andrewbogott: from /var/log/upstart/nova-fullstack.log [21:09:17] on labnet1001 [21:10:38] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [21:10:47] seems like the VMs are actually broken, so it's not entirely the test's fault... [21:11:01] that recovery is probably me running a manual test [21:13:02] I think it's dns that's failing [21:13:05] andrewbogott: my manual test says [21:13:06] 2017-08-29 21:12:35,328 DEBUG sudo: unable to resolve host manual-fullstack-1504041011 [21:13:06] and the instance is actually fine... [21:14:32] andrewbogott: yeah, so far DNS works for me generally for existing things but maybe isn't working for new instances? [21:15:29] lots of timeouts in the designate-sink log [21:15:38] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [21:16:19] !log restarting designate-sink on labservices1001 [21:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:45] andrewbogott: are labservices hosts ssh'ing to labcontrol? [21:18:04] that could ahve been mysteriously allowed via https://gerrit.wikimedia.org/r/#/c/374424/2 previously and now timing out [21:18:09] oh [21:18:16] yes, if there's not an explicit rule for that there should be [21:18:18] PROBLEM - MariaDB Slave Lag: s4 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 409.45 seconds [21:18:23] in order to clean up salt certificates [21:18:24] let me look [21:19:07] (03PS1) 10Rush: Revert "openstack: remove legacy firewall rules for controller" [puppet] - 10https://gerrit.wikimedia.org/r/374641 [21:19:13] (03PS2) 10Rush: Revert "openstack: remove legacy firewall rules for controller" [puppet] - 10https://gerrit.wikimedia.org/r/374641 [21:19:49] (03CR) 10Rush: [C: 032] Revert "openstack: remove legacy firewall rules for controller" [puppet] - 10https://gerrit.wikimedia.org/r/374641 (owner: 10Rush) [21:20:09] andrewbogott: I'm going to revert that cleanup to work on explicit rules for the moment [21:20:17] ok, thanks [21:20:47] I guess is that this is partly due to the fact that I separated the salt and puppetmasters. So there's an explicit rule but that moved to the new puppetmasters… leaving the salt master untended. [21:21:16] ah, yeah it's also what makes overly broad and permissive rules difficult to reason about [21:21:57] so we want something like [21:21:59] https://www.irccloud.com/pastebin/12xtnM5b/ [21:22:03] andrewbogott: I'm merging and applying, coudl you clear out the nova-fullstack backlog? [21:22:05] on labcontrol [21:22:08] yep [21:23:18] RECOVERY - MariaDB Slave Lag: s4 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [21:24:39] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [21:25:45] watching a test run through now [21:25:55] dns looks better [21:26:06] I restarted fullstack probably right after you did so we might leak one of those two that are running now [21:26:40] heh kk [21:26:45] should have cooridinated [21:27:52] ok if I delete fullstackd-1504041863 now so it's actually clean? [21:28:03] andrewbogott: so seems to be working, and somehow DNS failures bubble up to be a funky puppet error [21:28:05] andrewbogott: yup [21:28:20] but it's kind of cool that this issue surfaced in short order in an expected place [21:28:33] yeah, better than finding out about it tomorrow from a user [21:28:46] Do you want to write the proper firewall patch or shall I? [21:29:09] andrewbogott: go for it, can you think of anything else that ssh's to labcontrol automation wise? [21:29:33] ... [21:29:36] https://ru.wikipedia.org/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%92%D0%BA%D0%BB%D0%B0%D0%B4/128.0.0.0/2 <-- this block is not right [21:29:42] a /2 block?! [21:29:50] what is that? a zillion of IPs? [21:32:15] tabbycat something like 1073741824 [21:33:08] !log T169939: Decommission Cassandra: restbase1008-c.eqiad.wmnet [21:33:15] tabbycat, relevant: https://labs.ripe.net/Members/emileaben/the-curious-case-of-128.0-16 [21:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:22] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [21:33:24] a billion of IPs, nah, doesn't look too bad :P [21:35:56] (03PS1) 10Andrew Bogott: openstack: refine firewall rules for controller [puppet] - 10https://gerrit.wikimedia.org/r/374644 (https://phabricator.wikimedia.org/T171494) [21:36:17] (03CR) 10jerkins-bot: [V: 04-1] openstack: refine firewall rules for controller [puppet] - 10https://gerrit.wikimedia.org/r/374644 (https://phabricator.wikimedia.org/T171494) (owner: 10Andrew Bogott) [21:36:24] chasemp: I think ^ is right but let's wait until tomorrow to merge [21:37:22] andrewbogott: kk but modules/role/manifests/labs/openstack/nova/controller.pp:45 ERROR single quoted string containing a variable found (single_quote_string_with_variables) [21:37:35] yeah, fixed [21:37:36] (03PS2) 10Andrew Bogott: openstack: refine firewall rules for controller [puppet] - 10https://gerrit.wikimedia.org/r/374644 (https://phabricator.wikimedia.org/T171494) [21:37:39] sweet [21:38:07] (03CR) 10Rush: [C: 031] "nice, tomorrow we march" [puppet] - 10https://gerrit.wikimedia.org/r/374644 (https://phabricator.wikimedia.org/T171494) (owner: 10Andrew Bogott) [21:45:19] chasemp: sorry about all the landmines in that openstack code! It suffers from 5 years of incrementalism [21:45:55] andrewbogott: no worries, I appreciate you helping me walk back through it [21:46:58] tabbycat: Seems to be some kind of display issue. 128./2 would be 128.0.0.0-191.255.255.255, but I'm fine editing with a 134.2. address (logged out). So I guess it's not really the case that they're blocking one fourth of all ip addresses ;) Mediawiki doesn't allow blocking to big ranges anyways afaik. [21:47:44] most likely... I know mw do not allow rangeblocks larger than /16 on IPv4 [21:48:11] I was doing some CU stuff and saw that and was like... hello?! [21:48:50] Yeah, I thought the same when reading this, that's why I tried out with some IP from that range. :D [21:54:03] (03PS1) 10RobH: further tweaks to kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/374645 (https://phabricator.wikimedia.org/T174457) [22:23:12] (03PS1) 10Rush: prometheus: allow setting a specific listening address and port [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039) [22:55:02] (03PS1) 10MaxSem: Migrate AbuseFilter config off wmg variables, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374651 [22:55:04] (03PS1) 10MaxSem: Migrate AbuseFilter config off wmg variables, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374652 [22:55:06] (03PS1) 10MaxSem: Move a variable closer to other relevant code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374653 [22:56:36] (03CR) 10jerkins-bot: [V: 04-1] Migrate AbuseFilter config off wmg variables, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374651 (owner: 10MaxSem) [22:57:59] (03PS1) 10EBernhardson: Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) [22:59:01] (03CR) 10jerkins-bot: [V: 04-1] Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) (owner: 10EBernhardson) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170829T2300). [23:00:05] bmansurov: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:59] here [23:01:29] I can swat it [23:02:05] (03PS2) 10EBernhardson: Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) [23:03:45] (03CR) 10jerkins-bot: [V: 04-1] Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) (owner: 10EBernhardson) [23:08:29] (03PS3) 10GWicke: Enable JobQueueEventBus on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374399 (owner: 10Ppchelko) [23:17:51] RainbowSprinkles, thanks for swatting, where can I see the change? Or is it not live yet? [23:29:39] (03PS2) 10MaxSem: Migrate AbuseFilter config off wmg variables, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374651 [23:29:41] (03PS2) 10MaxSem: Migrate AbuseFilter config off wmg variables, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374652 [23:29:43] (03PS2) 10MaxSem: Move a variable closer to other relevant code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374653 [23:29:47] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3564554 (10Cmjohnson) Spoke with HP support, not very helpful. They will not send anyone to help unless we want to pay for it. Going to try and talk... [23:32:01] bmansurov: I'm so sorry, got distracted going down a rabbit hole. It's merged, one sec and I'll sync it live [23:32:16] thanks [23:33:40] !log demon@tin Synchronized php-1.30.0-wmf.16/extensions/Popups/: swat (duration: 00m 50s) [23:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:55] Wait, wrong directory [23:33:56] lol [23:34:42] Lol [23:34:49] !log demon@tin Synchronized php-1.30.0-wmf.15/extensions/Popups/: swat (duration: 00m 47s) [23:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:40] Ok there we go [23:36:11] I see the change at 1002 [23:36:33] It's live everywhere :) [23:36:35] RainbowSprinkles, thanks! [23:47:08] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3564575 (10Eevans) [23:49:58] PROBLEM - Host mw1228 is DOWN: PING CRITICAL - Packet loss = 100% [23:50:15] !log reinstalling mw1228 [23:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:18] RECOVERY - Host mw1228 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms