[01:04:17] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509498249 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3826336 keys, up 4 minutes 6 seconds - replication_delay is 1509498249 [01:04:17] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:04:17] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [01:04:27] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [01:04:38] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509498275 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3824473 keys, up 4 minutes 31 seconds - replication_delay is 1509498275 [01:05:08] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509498303 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3829264 keys, up 5 minutes - replication_delay is 1509498303 [01:05:17] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3817053 keys, up 5 minutes 7 seconds - replication_delay is 0 [01:05:18] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8523026 keys, up 5 minutes 12 seconds - replication_delay is 0 [01:05:18] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8418411 keys, up 5 minutes 12 seconds - replication_delay is 0 [01:05:18] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8520588 keys, up 5 minutes 13 seconds - replication_delay is 0 [01:05:38] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3815296 keys, up 5 minutes 32 seconds - replication_delay is 0 [01:06:08] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3819551 keys, up 6 minutes 1 seconds - replication_delay is 0 [02:06:27] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:36:21] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.5) (duration: 09m 46s) [02:36:27] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:57] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:06:57] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:14:46] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.6) (duration: 15m 21s) [03:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:02] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Nov 1 03:22:02 UTC 2017 (duration 7m 17s) [03:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:07] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 838.48 seconds [03:33:28] PROBLEM - HHVM rendering on mw2201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:18] RECOVERY - HHVM rendering on mw2201 is OK: HTTP OK: HTTP/1.1 200 OK - 73647 bytes in 0.357 second response time [03:44:38] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:07:37] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:14:38] RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:18:27] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 217.59 seconds [04:26:30] 10Operations, 10Cloud-Services, 10Developer-Relations, 10LDAP: Create a single application to provision and manage developer (LDAP) accounts - https://phabricator.wikimedia.org/T179463#3725515 (10bd808) [04:32:37] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [05:42:17] PROBLEM - puppet last run on hassium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:05:28] PROBLEM - puppet last run on es2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:07:17] RECOVERY - puppet last run on hassium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:25:17] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:30:50] PROBLEM - mysqld processes on labsdb1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [06:35:28] RECOVERY - puppet last run on es2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:51:00] RECOVERY - mysqld processes on labsdb1001 is OK: PROCS OK: 1 process with command name mysqld [06:55:00] PROBLEM - mysqld processes on labsdb1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [06:55:18] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:00] RECOVERY - mysqld processes on labsdb1001 is OK: PROCS OK: 1 process with command name mysqld [06:58:00] 10Operations, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3725553 (10Marostegui) [06:59:04] 10Operations, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3725567 (10Marostegui) I am trying to start MySQL but it failing on storage so I think this server is no longer available: ``` 171101 6:57:13 [Note] InnoDB: Starting an apply batch of l... [07:00:08] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3725569 (10Marostegui) Please check: T179464 labsdb1001 has crashed and the storage looks totally broken, hard to say if it is... [07:00:25] (03PS1) 10Madhuvishy: Revert "Revert "labsdb: Switchover dns for labsdb1001 shards to labsdb1003"" [puppet] - 10https://gerrit.wikimedia.org/r/387772 (https://phabricator.wikimedia.org/T179464) [07:01:00] (03PS2) 10Madhuvishy: Revert "Revert "labsdb: Switchover dns for labsdb1001 shards to labsdb1003"" [puppet] - 10https://gerrit.wikimedia.org/r/387772 (https://phabricator.wikimedia.org/T179464) [07:01:02] (03CR) 10Marostegui: [C: 031] Revert "Revert "labsdb: Switchover dns for labsdb1001 shards to labsdb1003"" [puppet] - 10https://gerrit.wikimedia.org/r/387772 (https://phabricator.wikimedia.org/T179464) (owner: 10Madhuvishy) [07:01:03] PROBLEM - mysqld processes on labsdb1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [07:01:09] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "labsdb: Switchover dns for labsdb1001 shards to labsdb1003"" [puppet] - 10https://gerrit.wikimedia.org/r/387772 (https://phabricator.wikimedia.org/T179464) (owner: 10Madhuvishy) [07:01:31] going to downtime labsdb1001 [07:02:23] (03PS3) 10Madhuvishy: Revert "Revert "labsdb: Switch dns for labsdb1001 shards to labsdb1003"" [puppet] - 10https://gerrit.wikimedia.org/r/387772 (https://phabricator.wikimedia.org/T179464) [07:02:53] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "labsdb: Switch dns for labsdb1001 shards to labsdb1003"" [puppet] - 10https://gerrit.wikimedia.org/r/387772 (https://phabricator.wikimedia.org/T179464) (owner: 10Madhuvishy) [07:03:19] (03PS4) 10Madhuvishy: Revert "Revert "labsdb: Switch dns for labsdb1001 to labsdb1003"" [puppet] - 10https://gerrit.wikimedia.org/r/387772 (https://phabricator.wikimedia.org/T179464) [07:03:48] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "labsdb: Switch dns for labsdb1001 to labsdb1003"" [puppet] - 10https://gerrit.wikimedia.org/r/387772 (https://phabricator.wikimedia.org/T179464) (owner: 10Madhuvishy) [07:04:30] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3725571 (10Marostegui) We should consider labsdb1001 broken for good and decommission it - we need to decide whether we want... [07:04:41] (03PS5) 10Madhuvishy: Revert "Revert "labsdb: Switch dns for labsdb1001 to labsdb1003"" [puppet] - 10https://gerrit.wikimedia.org/r/387772 (https://phabricator.wikimedia.org/T179464) [07:05:17] (03CR) 10Madhuvishy: [C: 032] Revert "Revert "labsdb: Switch dns for labsdb1001 to labsdb1003"" [puppet] - 10https://gerrit.wikimedia.org/r/387772 (https://phabricator.wikimedia.org/T179464) (owner: 10Madhuvishy) [07:07:55] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3725577 (10Marostegui) btw, the RAID keeps saying Optimal :-) ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level... [07:12:47] PROBLEM - HHVM rendering on mw2151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:13:38] RECOVERY - HHVM rendering on mw2151 is OK: HTTP OK: HTTP/1.1 200 OK - 73028 bytes in 0.296 second response time [07:25:37] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3725595 (10Marostegui) I have disabled notifications and downtimed labsdb1001 [08:17:34] (03PS2) 10Ema: varnish child started: avoid illegal characters [puppet] - 10https://gerrit.wikimedia.org/r/387242 [08:17:39] (03CR) 10Ema: [V: 032 C: 032] varnish child started: avoid illegal characters [puppet] - 10https://gerrit.wikimedia.org/r/387242 (owner: 10Ema) [08:19:27] (03PS3) 10Ema: puppet: fix trailing slash on file resource /usr/share/varnish/tests [puppet] - 10https://gerrit.wikimedia.org/r/387584 (https://phabricator.wikimedia.org/T179396) (owner: 10Herron) [08:19:32] (03CR) 10Ema: [V: 032 C: 032] puppet: fix trailing slash on file resource /usr/share/varnish/tests [puppet] - 10https://gerrit.wikimedia.org/r/387584 (https://phabricator.wikimedia.org/T179396) (owner: 10Herron) [08:24:53] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3711643 (10MoritzMuehlenhoff) This is currently installed with jessie, but if we setup a new box, let's use stretch from the start? [08:29:39] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3725662 (10MoritzMuehlenhoff) >>! In T168584#3725571, @Marostegui wrote: > We should consider labsdb1001 broken for good and d... [08:31:31] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3725667 (10Marostegui) We are aiming for 13th Dec to retire these two hosts: T142807 and https://wikitech.wikimedia.org/wiki/W... [08:34:42] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3725668 (10MoritzMuehlenhoff) Let's just keep 1003 running w/o reboot then. [08:35:37] RECOVERY - Disk space on stat1005 is OK: DISK OK [08:35:57] !log forced umount/mount for /mnt/hdfs on stat1005 (not working after repeated oom kill actions) [08:36:03] apergos: --^ [08:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:37] 10Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T179452#3725688 (10Framawiki) a:05Mehrdadbot>03None Hello @Mehrdadbot and welcome ! What "RESOURCE" you want to access ? [08:49:06] elukey: thanks, saw the emails yesterday [08:58:57] (03CR) 10Ema: "Couple of inline comments, the rest looks good." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/386895 (owner: 10BBlack) [09:23:18] 10Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T179452#3725752 (10Mehrdadbot) yes. thanks. [09:26:26] (03PS3) 10DCausse: Properly check for cluster existence prior setting TTM mirrors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387281 (https://phabricator.wikimedia.org/T179270) [09:28:05] (03PS1) 10Alexandros Kosiaris: k8s::controller: Notify service on config changes [puppet] - 10https://gerrit.wikimedia.org/r/387775 [09:29:51] (03CR) 10Alexandros Kosiaris: [C: 032] k8s::controller: Notify service on config changes [puppet] - 10https://gerrit.wikimedia.org/r/387775 (owner: 10Alexandros Kosiaris) [09:33:52] (03PS2) 10Alexandros Kosiaris: Remove $cluster_cidr from k8s::controller [puppet] - 10https://gerrit.wikimedia.org/r/386753 [09:33:54] (03PS2) 10Alexandros Kosiaris: k8s::controller: support service account token signing [puppet] - 10https://gerrit.wikimedia.org/r/386754 (https://phabricator.wikimedia.org/T177393) [09:33:56] (03PS2) 10Alexandros Kosiaris: Enable k8s::controller manager ServiceAccount signing [puppet] - 10https://gerrit.wikimedia.org/r/386755 (https://phabricator.wikimedia.org/T177393) [09:35:25] (03CR) 10Alexandros Kosiaris: [C: 032] Remove $cluster_cidr from k8s::controller [puppet] - 10https://gerrit.wikimedia.org/r/386753 (owner: 10Alexandros Kosiaris) [09:37:03] (03CR) 10Alexandros Kosiaris: [C: 032] k8s::controller: support service account token signing [puppet] - 10https://gerrit.wikimedia.org/r/386754 (https://phabricator.wikimedia.org/T177393) (owner: 10Alexandros Kosiaris) [09:39:51] (03CR) 10Alexandros Kosiaris: [C: 032] Enable k8s::controller manager ServiceAccount signing [puppet] - 10https://gerrit.wikimedia.org/r/386755 (https://phabricator.wikimedia.org/T177393) (owner: 10Alexandros Kosiaris) [09:42:51] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Scap: Standardize git version - https://phabricator.wikimedia.org/T179353#3725775 (10MoritzMuehlenhoff) silver will be replaced by the new labweb* hosts using stretch soon, so that should be resolved soon. Is that the only one deployment... [09:44:33] Hi, I have a problem uploading a PDF (this book https://commons.wikimedia.org/wiki/File:Tolsto%C3%AF_-_%C5%92uvres_compl%C3%A8tes,_vol10.djvu ), it says the file is corrupted, I made it again -> same error, it is a big file (174 MB), but it's the first time I get this [09:46:06] I am trying to upload over https://commons.wikimedia.org/wiki/File:Tolsto%C3%AF_-_%C5%92uvres_compl%C3%A8tes,_vol10.pdf which chunked upload [09:52:29] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:55:49] PROBLEM - puppet last run on ms-fe2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:59:08] (03CR) 10GoranSMilovanovic: "> Yeah, until WMF/WMDE has a CRAN mirror we can't install packages" [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [10:03:00] 10Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T179452#3725800 (10Mehrdadbot) shell access(tool forge)... [10:03:30] (03CR) 10Nikerabbit: [C: 031] Properly check for cluster existence prior setting TTM mirrors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387281 (https://phabricator.wikimedia.org/T179270) (owner: 10DCausse) [10:07:58] PROBLEM - puppet last run on mw2199 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:17:29] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:20:32] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3725819 (10Tobi_WMDE_SW) [10:20:49] RECOVERY - puppet last run on ms-fe2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:22:09] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:04] that's the file: https://www.dropbox.com/s/c25hri2mfhbkmop/Tolsto%C3%AF%20-%20OC%20-%20tome%2010%20-%20GP%2C%204.pdf?dl=0 [10:35:18] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:37:58] RECOVERY - puppet last run on mw2199 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:51:14] (03CR) 10jenkins-bot: Enable Unicode section links on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386710 (https://phabricator.wikimedia.org/T175725) (owner: 10MaxSem) [10:51:16] (03CR) 10jenkins-bot: Setup CirrusSearch AB test on dbn group sizing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 (owner: 10EBernhardson) [10:51:18] (03CR) 10jenkins-bot: Scap prep: check reference directory exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387743 (owner: 10Thcipriani) [10:51:20] (03CR) 10jenkins-bot: cirrus interleave config should not be wg prefixed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387752 (owner: 10EBernhardson) [10:52:09] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:55:04] (03PS2) 10Muehlenhoff: Ship a dummy config since IncludeOptional isn't really optional [puppet] - 10https://gerrit.wikimedia.org/r/386617 [11:03:03] oh good, I just got a batch of labsdb1001 pages from hours ago >_< [11:04:33] 10Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T179452#3725973 (10Framawiki) 05Open>03Invalid Hello @Mehrdadbot, all the steeps to ask for a shell account on toolforge are present on https://tools.wmflabs.org/, I let you follow these guide... [11:05:18] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:07:16] (03PS1) 10ArielGlenn: convert script generating lists of dumps for rsync, to use config overrides [puppet] - 10https://gerrit.wikimedia.org/r/387781 [11:10:26] (03CR) 10Muehlenhoff: [C: 032] Ship a dummy config since IncludeOptional isn't really optional [puppet] - 10https://gerrit.wikimedia.org/r/386617 (owner: 10Muehlenhoff) [11:12:18] RECOVERY - HHVM rendering on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 72972 bytes in 7.692 second response time [11:12:19] RECOVERY - Apache HTTP on labweb1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 620 bytes in 0.072 second response time [11:12:38] RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational [11:13:59] (03PS2) 10ArielGlenn: convert script generating lists of dumps for rsync, to use config overrides [puppet] - 10https://gerrit.wikimedia.org/r/387781 [11:14:39] RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:15:10] (03CR) 10ArielGlenn: [C: 032] convert script generating lists of dumps for rsync, to use config overrides [puppet] - 10https://gerrit.wikimedia.org/r/387781 (owner: 10ArielGlenn) [11:20:38] PROBLEM - puppet last run on wdqs2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:22:24] (03PS6) 10ArielGlenn: use separate path for public/other datasets [puppet] - 10https://gerrit.wikimedia.org/r/386161 (https://phabricator.wikimedia.org/T178888) [11:30:29] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:50:38] RECOVERY - puppet last run on wdqs2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:00:29] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:07:23] !log kartik@tin Started deploy [cxserver/deploy@10651e2]: Update cxserver to 0227acb [12:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:30] !log kartik@tin Finished deploy [cxserver/deploy@10651e2]: Update cxserver to 0227acb (duration: 03m 07s) [12:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:38] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: Traceback (most recent call last) [12:40:43] 10Operations, 10media-storage, 10User-fgiunchedi: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374#3726148 (10Aklapper) Should this task get closed as `resolved`? [12:42:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 7 probes of 285 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:44:54] !log installinng libdatetime-timezone-perl stable updates on Debian [12:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:08] !log installing libav security updates [12:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171101T1300). [13:00:04] dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] o/ [13:26:58] PROBLEM - puppet last run on restbase2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:17] !log ppchelko@tin Started deploy [restbase/deploy@2321c4c]: Update hyperswitch dependency [13:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:49] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [13:29:49] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [13:30:48] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [13:30:48] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [13:33:06] !log ppchelko@tin Finished deploy [restbase/deploy@2321c4c]: Update hyperswitch dependency (duration: 03m 50s) [13:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:57] dcausse: have you managed to deploy your changes? [13:35:10] I was at the restaurant with familly and it has taken ages .. [13:35:13] !log ppchelko@tin Started deploy [restbase/deploy@2321c4c]: Update hyperswitch dependency. Take 2 [13:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:22] hashar: np, I can deploy if it helps [13:35:39] (03CR) 10Ottomata: "Depends on what data you are accessing in Hadoop. If you need to access things like webrequest logs, the user accessing the data needs to" [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [13:36:18] dcausse: if that is for labs, you can override the settings in CommonSettings-labs.php [13:36:28] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:03] hashar: yes mostly for labs but I did not want to redo everything in the -labs.php file [13:37:14] ;D [13:37:34] (03CR) 10Hashar: [C: 032] Properly check for cluster existence prior setting TTM mirrors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387281 (https://phabricator.wikimedia.org/T179270) (owner: 10DCausse) [13:37:50] fair :) [13:38:08] (03CR) 10Hashar: [C: 032] Enable blocking feature of abuse filter in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384252 (https://phabricator.wikimedia.org/T178227) (owner: 10Ladsgroup) [13:38:41] (03CR) 10Hashar: [C: 032] Enable NewUserMessage on fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387741 (https://phabricator.wikimedia.org/T179442) (owner: 10Ladsgroup) [13:38:43] (03Merged) 10jenkins-bot: Properly check for cluster existence prior setting TTM mirrors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387281 (https://phabricator.wikimedia.org/T179270) (owner: 10DCausse) [13:38:54] (03CR) 10jenkins-bot: Properly check for cluster existence prior setting TTM mirrors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387281 (https://phabricator.wikimedia.org/T179270) (owner: 10DCausse) [13:38:59] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [13:39:34] dcausse: is that testable? I pulled it on mwdebug1001 [13:39:50] hashar: if it's possible to pull on terbium I can test [13:40:08] I guess it is all about running "scap pull" on terbium [13:40:15] if you wanna give it a try [13:40:20] ok testing [13:40:39] hashar: I'm here now [13:40:46] it would sync terbium /srv/mediawiki with whatever has been fetched on tin.eqiad.wmnet [13:40:52] to be honest I forgot I put stuff in SWAT [13:41:03] !log ppchelko@tin Finished deploy [restbase/deploy@2321c4c]: Update hyperswitch dependency. Take 2 (duration: 05m 50s) [13:41:06] sorry [13:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:18] (03PS4) 10Hashar: Enable blocking feature of abuse filter in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384252 (https://phabricator.wikimedia.org/T178227) (owner: 10Ladsgroup) [13:41:20] grblblb [13:41:23] stupid merge conflicts [13:41:33] (03CR) 10Hashar: [C: 032] Enable blocking feature of abuse filter in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384252 (https://phabricator.wikimedia.org/T178227) (owner: 10Ladsgroup) [13:41:53] I can't test it but it looks straightforward I guess [13:41:58] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [13:42:05] Amir1: yeah I will just sync them [13:42:14] I am not worried about those fawikiquote patches [13:42:19] hashar: looks good, config is unchanged in prod [13:42:25] cool [13:43:50] !log hashar@tin Synchronized wmf-config/CommonSettings.php: Properly check for cluster existence prior setting TTM mirrors - T179270 (duration: 01m 05s) [13:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:57] T179270: TTMServerMessageUpdateJob fails in labs - https://phabricator.wikimedia.org/T179270 [13:44:44] hashar: thank you [13:44:55] (03Merged) 10jenkins-bot: Enable blocking feature of abuse filter in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384252 (https://phabricator.wikimedia.org/T178227) (owner: 10Ladsgroup) [13:45:36] hashar: thanks! [13:45:53] (03PS2) 10Hashar: Enable NewUserMessage on fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387741 (https://phabricator.wikimedia.org/T179442) (owner: 10Ladsgroup) [13:45:57] !log installing quagga security updates [13:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:03] dcausse: et n'oublie pas la toussaint :] [13:46:14] heh :) [13:46:19] !log hashar@tin Synchronized wmf-config/abusefilter.php: Enable blocking feature of abuse filter in fawikiquote - T178227 (duration: 00m 50s) [13:46:23] (03CR) 10GoranSMilovanovic: "> Depends on what data you are accessing in Hadoop. If you need to" [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [13:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:25] T178227: Enable blocking feature of abuse filter in fawikiquote - https://phabricator.wikimedia.org/T178227 [13:46:26] (03CR) 10jenkins-bot: Enable blocking feature of abuse filter in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384252 (https://phabricator.wikimedia.org/T178227) (owner: 10Ladsgroup) [13:46:40] (03CR) 10Hashar: [C: 032] Enable NewUserMessage on fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387741 (https://phabricator.wikimedia.org/T179442) (owner: 10Ladsgroup) [13:47:15] Notice: Undefined index: enwiki in /srv/mediawiki/php-1.31.0-wmf.5/extensions/ORES/includes/Cache.php on line 52 [13:47:15] Warning: Invalid argument supplied for foreach() in /srv/mediawiki/php-1.31.0-wmf.5/extensions/ORES/includes/Cache.php on line 56 [13:47:29] Amir1: unrelated to SWAT but ORES got some notice/warning :) [13:47:45] (03Merged) 10jenkins-bot: Enable NewUserMessage on fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387741 (https://phabricator.wikimedia.org/T179442) (owner: 10Ladsgroup) [13:49:11] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable NewUserMessage on fawikiquote - T179442 (duration: 00m 50s) [13:49:15] (03CR) 10jenkins-bot: Enable NewUserMessage on fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387741 (https://phabricator.wikimedia.org/T179442) (owner: 10Ladsgroup) [13:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:18] T179442: Enable NewUserMessage on fawikiquote - https://phabricator.wikimedia.org/T179442 [13:49:19] Amir1: both changes deployed [13:49:43] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3726257 (10BBlack) >>! In T179050#3725651, @MoritzMuehlenhoff wrote: > This is currently installed with jessie, but if we setup a new box, let's use stretch from the start? +1 We may as well move to stre... [13:51:02] hashar: thanks [13:51:29] hashar: Regarding ORES, Adam is one it [13:51:32] *on [13:51:41] cool [13:51:46] (03Abandoned) 10Gehel: wdqs: cleanup JVM options [puppet] - 10https://gerrit.wikimedia.org/r/384663 (https://phabricator.wikimedia.org/T175919) (owner: 10Gehel) [13:55:37] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3726279 (10MoritzMuehlenhoff) >>! In T179050#3726257, @BBlack wrote: > +1 We may as well move to stretch here. For the bastion/installserver role it should be pretty simple? I wouldn't expect any proble... [13:56:05] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#3726284 (10Gehel) [13:56:07] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Investigate tweaking of the "wait for me" parameter for upgrades / restarts - https://phabricator.wikimedia.org/T109091#3726281 (10Gehel) 05Open>03Resolved a:03Gehel We have tuned a bit this part. The main issue is that as soon as writes... [13:56:58] RECOVERY - puppet last run on restbase2012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:59:57] 10Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T179452#3726299 (10Mehrdadbot) thanks. [14:01:28] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:03:33] (03PS3) 10Gehel: logstash: update logstash_syslog common hiera parameter to point to LVS. [puppet] - 10https://gerrit.wikimedia.org/r/383146 (https://phabricator.wikimedia.org/T175242) [14:09:44] !log cp*: disabling puppet to test strongswan change... [14:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:38] (03PS2) 10BBlack: strongswan: turn on fragmentation of IKE [puppet] - 10https://gerrit.wikimedia.org/r/387648 [14:11:17] (03CR) 10BBlack: [C: 032] strongswan: turn on fragmentation of IKE [puppet] - 10https://gerrit.wikimedia.org/r/387648 (owner: 10BBlack) [14:23:51] 10Operations, 10Traffic, 10Services (watching): restbase.svc.eqiad.wmnet directs requests to staging if the origin is staging too - https://phabricator.wikimedia.org/T179494#3726448 (10mobrovac) [14:24:39] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:40:02] 10Operations, 10Traffic, 10Services (watching): restbase.svc.eqiad.wmnet directs requests to staging if the origin is staging too - https://phabricator.wikimedia.org/T179494#3726448 (10mark) The staging hosts have the LVS service IP (restbase.svc.eqiad.wmnet, 10.2.2.17) bound to their loopback IP - as every... [14:41:45] !log installing poppler security updates [14:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:51] (03PS1) 10BBlack: Revert "strongswan: turn on fragmentation of IKE" [puppet] - 10https://gerrit.wikimedia.org/r/387803 [14:45:09] (03CR) 10BBlack: [V: 032 C: 032] Revert "strongswan: turn on fragmentation of IKE" [puppet] - 10https://gerrit.wikimedia.org/r/387803 (owner: 10BBlack) [14:47:16] (03PS1) 10BBlack: Remove borked cp4024 from ipsec nodelists [puppet] - 10https://gerrit.wikimedia.org/r/387805 (https://phabricator.wikimedia.org/T174891) [14:47:35] (03CR) 10BBlack: [V: 032 C: 032] Remove borked cp4024 from ipsec nodelists [puppet] - 10https://gerrit.wikimedia.org/r/387805 (https://phabricator.wikimedia.org/T174891) (owner: 10BBlack) [14:49:28] (03PS1) 10EBernhardson: Update CirrusSearch enwiki MLR model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387806 [14:50:04] can i sneak a mediawiki-config deploy in? It's a one line change to update the cirrus ranking model, there are some significant deficiencies with the one rolled out monday [14:51:38] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 112 ESP OK [14:53:06] 10Operations, 10Traffic, 10Services (watching): restbase.svc.eqiad.wmnet directs requests to staging if the origin is staging too - https://phabricator.wikimedia.org/T179494#3726448 (10BBlack) I don't think they're //currently// puppetized for lvs::realserver, but it looks like the machines had such a config... [14:54:18] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 54 ESP OK [14:54:38] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 54 ESP OK [14:54:39] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:54:48] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 68 ESP OK [14:54:48] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 112 ESP OK [14:54:49] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 54 ESP OK [14:54:58] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 54 ESP OK [14:55:08] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 68 ESP OK [14:55:08] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 54 ESP OK [14:55:09] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 68 ESP OK [14:55:09] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 54 ESP OK [14:55:17] !log strongswan experiment done, cp* back to puppet-agent-enabled [14:55:18] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 68 ESP OK [14:55:18] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 68 ESP OK [14:55:18] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 68 ESP OK [14:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:28] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 54 ESP OK [14:55:29] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 54 ESP OK [14:55:29] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 54 ESP OK [14:55:48] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 54 ESP OK [14:55:48] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 68 ESP OK [14:55:48] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 68 ESP OK [14:55:48] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 68 ESP OK [14:55:48] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 68 ESP OK [14:55:58] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 54 ESP OK [14:57:10] (03PS2) 10EBernhardson: Update CirrusSearch enwiki MLR model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387806 [14:59:11] (03CR) 10DCausse: [C: 031] Update CirrusSearch enwiki MLR model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387806 (owner: 10EBernhardson) [14:59:44] (03CR) 10EBernhardson: [C: 032] Update CirrusSearch enwiki MLR model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387806 (owner: 10EBernhardson) [15:00:59] !log restbase: removing wikimedia-lvs-realserver from staging hosts T179494 [15:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:05] T179494: restbase.svc.eqiad.wmnet directs requests to staging if the origin is staging too - https://phabricator.wikimedia.org/T179494 [15:01:45] (03Merged) 10jenkins-bot: Update CirrusSearch enwiki MLR model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387806 (owner: 10EBernhardson) [15:01:58] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 112 ESP OK [15:01:59] (03CR) 10jenkins-bot: Update CirrusSearch enwiki MLR model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387806 (owner: 10EBernhardson) [15:05:59] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 112 ESP OK [15:07:09] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 112 ESP OK [15:07:12] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Update CirrusSearch MLR model on enwiki (duration: 00m 51s) [15:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:18] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [15:07:35] looking [15:07:41] !log awight@tin Started deploy [ores/deploy@9f361d2]: revscoring 2 -> ores* (non-production) [15:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:10] looks like the mediawiki exceptions was nurelated to my push, they are mostly: [{exception_id}] {exception_url} Wikimedia\Rdbms\DBReplicationWaitError from line 372 of /srv/mediawiki/php-1.31.0-wmf.5/includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to db1062 [15:08:50] akosiaris: This might be in your domain? I’m trying to deploy to the new ORES cluster, and getting this error from ores2003 and ores2007: > 15:07:44 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'cluster', 'fetch', '--refresh-config'] on ores2003.codfw.wmnet returned [255]: Permission denied (publickey,keyboard-interactive). [15:09:18] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [15:09:37] !log awight@tin Finished deploy [ores/deploy@9f361d2]: revscoring 2 -> ores* (non-production) (duration: 01m 57s) [15:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:32] !log awight@tin Started deploy [ores/deploy@9f361d2]: revscoring 2 -> ores1002 (non-production) [15:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:06] akosiaris: thcipriani and I think there’s something wrong with those two machines. [15:12:58] !log awight@tin Finished deploy [ores/deploy@9f361d2]: revscoring 2 -> ores1002 (non-production) (duration: 02m 25s) [15:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:03] I just don't see them in puppet manifests/site.pp, but I could be missing it [15:13:47] 10Operations, 10Traffic, 10Services (done): restbase.svc.eqiad.wmnet directs requests to staging if the origin is staging too - https://phabricator.wikimedia.org/T179494#3726709 (10mobrovac) 05Open>03Resolved a:03mobrovac Ok, after a round of `apt-get remove --purge wikimedia-lvs-realserver && ip addr... [15:15:40] !log repo reorg: moved ftpsync from thirdparty to main and docker-engine from thirdparty to thirdparty/k8s [15:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:38] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 112 ESP OK [15:17:21] (03PS9) 10Muehlenhoff: Use new repository layout for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) [15:17:33] (03CR) 10Zoranzoki21: "> Per task description, "@jhsoby-WMNO will ask the (very small)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) (owner: 10Zoranzoki21) [15:17:37] (03PS3) 10Zoranzoki21: Enable the ArticlePlaceholder for Northern Sami (sewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) [15:30:10] (03CR) 10Chad: "Yes, that's my plan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384193 (https://phabricator.wikimedia.org/T104148) (owner: 10Chad) [15:30:25] (03PS3) 10Chad: Get rid of squid-file-labs in favor of new reverse-proxy-staging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384193 (https://phabricator.wikimedia.org/T104148) [15:31:00] (03PS7) 10ArielGlenn: use separate path for public/other datasets [puppet] - 10https://gerrit.wikimedia.org/r/386161 (https://phabricator.wikimedia.org/T178888) [15:35:28] 10Operations, 10media-storage, 10User-fgiunchedi: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374#3726757 (10Jcb) Yes, I think the task is resolved now the two files are gone and I haven't seen any ot... [15:36:20] awight: I am around, which seems to be the problem ? [15:37:33] akosiaris: thcipriani dug up more info than I have. tl;dr, scap can’t push to those machines. He says, > awight: so yesterday you had 2 errors for ores2003 and ores2007 but the thing is I don't see those in puppet anywhere https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L1976-L1988 and the deploy-service user can't ssh there from tin so there's some problem (I think) with the setup of those machines [15:38:08] Not urgent, btw. [15:41:22] awight: https://phabricator.wikimedia.org/T165170 those boxes have never been pooled into service intentionally [15:41:37] why are you trying to deploy code to them ? [15:42:08] akosiaris: aha, thanks for the info. Just out of ignorance. I’ll fix our scap to ignore those for now. [15:42:14] task is stalled btw per https://phabricator.wikimedia.org/T165170#3566244 on https://phabricator.wikimedia.org/T169246 [15:43:13] awight: actually this is information scap should not be having locally. It's best solved on the deployment server. Parsoid had the same issue and it's now fixed in a better way [15:43:18] lemme find the changes [15:43:35] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3725553 (10bd808) Announced on cloud-announce: https://lists.wikimedia.org/pipermail/cloud-announce/2017-November/000007.html [15:43:52] I’m confused. I was going to workaround the issue with https://gerrit.wikimedia.org/r/387811 [15:44:33] awight: https://gerrit.wikimedia.org/r/377966 [15:45:10] granted there is no ores dsh group yet [15:45:30] cause we are blocked on the stresstesting and haven't yet enabled that cluster [15:45:49] but that's something we should do [15:45:59] akosiaris: Interesting. I was getting advice to use environments rather than groups, but was going to hold off until we deprecate the old scb* deployments anyway. [15:46:17] But this external node list looks nice. Shall I make a task to do that? [15:46:45] I 'd say yes. Feel free to block it on getting the new cluster up and running though [15:46:51] it does make sense [15:49:02] 10Operations, 10ORES, 10Scoring-platform-team: Use external dsh group to list pooled ORES nodes - https://phabricator.wikimedia.org/T179501#3726795 (10awight) [15:49:21] 10Operations, 10ORES, 10Scoring-platform-team: Use external dsh group to list pooled ORES nodes - https://phabricator.wikimedia.org/T179501#3726808 (10awight) [15:50:27] (03CR) 10Jforrester: [C: 031] Get rid of squid-file-labs in favor of new reverse-proxy-staging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384193 (https://phabricator.wikimedia.org/T104148) (owner: 10Chad) [16:01:06] !log lvs1003 - puppet disabled, testing experimental ethtool ringbuffer change [16:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:45] 10Operations, 10ops-eqiad: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3726847 (10Ottomata) [16:12:09] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:12:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [16:12:18] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [16:13:06] ^ probably me, very thin spike, sorry! [16:13:22] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3726869 (10herron) [16:13:24] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppet4: Catalog failed: Catalog has broken references: varnish::wikimedia_vcl[/usr/share/varnish/tests/wikimedia-common_upload-backend.inc.vcl](/etc/puppet/modules/varnish/manifests/instance.pp:98 - https://phabricator.wikimedia.org/T179396#3726866 (1... [16:14:08] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [16:14:10] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3711053 (10herron) [16:15:47] (03PS8) 10ArielGlenn: use separate path for public/other datasets [puppet] - 10https://gerrit.wikimedia.org/r/386161 (https://phabricator.wikimedia.org/T178888) [16:18:04] (03CR) 10ArielGlenn: [C: 032] use separate path for public/other datasets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386161 (https://phabricator.wikimedia.org/T178888) (owner: 10ArielGlenn) [16:21:18] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:21:18] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:21:18] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:22:09] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:23:32] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3726914 (10herron) Notes and observations from upgrading puppetmaster2001 via `apt-get install puppetmaster` puppet packages. 1. The p... [16:27:33] (03PS2) 10Gehel: use the logstash LVS endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383355 (https://phabricator.wikimedia.org/T175242) [16:37:39] (03PS1) 10Ema: VCL: add layer information to X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/387817 [16:43:40] (03PS1) 10Chad: group1 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387818 [16:43:42] (03CR) 10Chad: [C: 04-2] group1 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387818 (owner: 10Chad) [16:46:03] (03CR) 10Hashar: "check experimental" [debs/pybal] - 10https://gerrit.wikimedia.org/r/384483 (https://phabricator.wikimedia.org/T178149) (owner: 10Ema) [16:47:21] (03CR) 10jenkins-bot: 1.14.2: do not crash on empty runcommand.arguments [debs/pybal] - 10https://gerrit.wikimedia.org/r/384483 (https://phabricator.wikimedia.org/T178149) (owner: 10Ema) [16:48:20] jouncebot: next [16:48:21] In 1 hour(s) and 11 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171101T1800) [16:48:41] * no_justification thwacks jouncebot over the head [16:48:46] (03CR) 10Chad: [C: 032] Get rid of squid-file-labs in favor of new reverse-proxy-staging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384193 (https://phabricator.wikimedia.org/T104148) (owner: 10Chad) [16:49:54] (03Merged) 10jenkins-bot: Get rid of squid-file-labs in favor of new reverse-proxy-staging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384193 (https://phabricator.wikimedia.org/T104148) (owner: 10Chad) [16:50:04] (03CR) 10jenkins-bot: Get rid of squid-file-labs in favor of new reverse-proxy-staging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384193 (https://phabricator.wikimedia.org/T104148) (owner: 10Chad) [16:54:58] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team: Use external dsh group to list pooled ORES nodes - https://phabricator.wikimedia.org/T179501#3727071 (10Halfak) p:05Triage>03Low [17:00:57] (03CR) 10DCausse: [C: 031] use the logstash LVS endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383355 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [17:01:58] (03CR) 10EBernhardson: [C: 031] "For the code, this is certainly correct. For if all the services that use it will work appropriately ... probably? Will have to monitor ro" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383355 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [17:06:19] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3725553 (10madhuvishy) Disk setup for labsdb1001 * `/dev/sda -> 3.271TB after Hardware RAID 10 (H800, External shelf, 12 Disks 558.911 GB each)` * `/dev/sd[b,c,d,e... [17:09:06] !log demon@tin Synchronized docroot/noc/: dropped squid-labs.php (duration: 00m 51s) [17:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:42] (03CR) 10Hashar: "check experimental" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/149387 (owner: 10Hashar) [17:10:21] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/149387 (owner: 10Hashar) [17:10:40] !log demon@tin Synchronized wmf-config/: dropped squid-labs, no-op in prod (duration: 00m 52s) [17:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:38] (03CR) 10Jayprakash12345: "@Reedy, Dereckson, Hashar Can you create the SQL table for shorturl. So that I can Schedule this patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) (owner: 10Jayprakash12345) [17:17:07] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 2 others: [subtask] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3727197 (10Niedzielski) [17:26:03] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3725553 (10Superyetkin) Any chance to recover data? [17:30:04] (03PS1) 10Jforrester: Get rid of squid.php in favor of new reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387832 (https://phabricator.wikimedia.org/T104148) [17:30:27] (03CR) 10Jforrester: "Non-staging version: I3ceac441" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384193 (https://phabricator.wikimedia.org/T104148) (owner: 10Chad) [17:39:01] (03PS1) 10ArielGlenn: add new dumpsgen user to dataset1001 and ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/387834 (https://phabricator.wikimedia.org/T178893) [17:39:12] !log demon@tin Synchronized php-1.31.0-wmf.6/includes/specials/pagers/ContribsPager.php: fix missing page_is_new error (duration: 00m 51s) [17:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:17] James_F: Well, the labs one went out and the world didn't explode :p [17:44:56] (03PS2) 10ArielGlenn: add new dumpsgen user to dataset1001 and ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/387834 (https://phabricator.wikimedia.org/T178893) [17:52:02] (03CR) 10ArielGlenn: [C: 032] add new dumpsgen user to dataset1001 and ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/387834 (https://phabricator.wikimedia.org/T178893) (owner: 10ArielGlenn) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Morning SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171101T1800). [18:00:04] Jayprakash12345: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:55] yah i am ready [18:02:41] !log installing openjpeg2 security updates [18:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:44] who will swat? [18:09:40] (03CR) 10Chad: [C: 032] New logo for se.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) (owner: 10SimmeD) [18:11:12] (03PS1) 10Ottomata: Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) [18:12:02] (03Merged) 10jenkins-bot: New logo for se.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) (owner: 10SimmeD) [18:12:08] (03CR) 10jerkins-bot: [V: 04-1] Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) (owner: 10Ottomata) [18:12:16] (03CR) 10jenkins-bot: New logo for se.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) (owner: 10SimmeD) [18:13:30] !log demon@tin Synchronized static/images/project-logos/sewikimedia.png: new logo (duration: 00m 50s) [18:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:37] Jayprakash12345: Done. [18:14:32] (03PS2) 10Ottomata: Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) [18:14:41] yah I was checked the patch in mwdebug1002. everthing is fine. Thank You [18:15:05] (03CR) 10jerkins-bot: [V: 04-1] Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) (owner: 10Ottomata) [18:16:12] (03PS3) 10Ottomata: Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) [18:16:41] (03CR) 10jerkins-bot: [V: 04-1] Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) (owner: 10Ottomata) [18:17:54] (03PS4) 10Ottomata: Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) [18:18:27] (03CR) 10jerkins-bot: [V: 04-1] Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) (owner: 10Ottomata) [18:19:15] (03PS5) 10Ottomata: Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) [18:19:20] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3727488 (10Legoktm) >>! In T177891#3719471, @Addshore wrote: > So, I just tried deploying the above change. > While testing on mwdebug1... [18:20:04] (03CR) 10jerkins-bot: [V: 04-1] Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) (owner: 10Ottomata) [18:21:53] (03CR) 10Zoranzoki21: [C: 031] Get rid of squid.php in favor of new reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387832 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester) [18:22:09] Chad: Can you Synchronized static/images/project-logos/sewikimedia.png: new logo with task Number like TXXXX again? Because The change is disapperaring after the off mwdebug1002. [18:22:34] (03CR) 10Chad: [C: 032] Get rid of squid.php in favor of new reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387832 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester) [18:23:43] Disappearing after the off? [18:23:45] I don't understand [18:25:48] (03Merged) 10jenkins-bot: Get rid of squid.php in favor of new reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387832 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester) [18:26:18] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [18:26:29] (03CR) 10jenkins-bot: Get rid of squid.php in favor of new reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387832 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester) [18:28:02] When I switched off mwdebug1002 then old logo came again. And I saw that you Synchronized static/images/project-logos/sewikimedia.png: new logo. So why new logo not came? [18:29:15] probably cached :) [18:29:48] no_justification: Can you Synchronized static/images/project-logos/sewikimedia.png: new logo with task Number like TXXXX again? [18:29:48] what's public the URL to see the logo? [18:30:01] I don't think he needs to sync it again [18:31:11] please go https://se.wikimedia.org and tell me what is the color of logo. [18:31:25] old logo are in green [18:32:08] Yep, looks cached [18:32:11] Lemme force a purge [18:32:11] what's the public URL of the logo file itself? [18:32:33] https://se.wikimedia.org/static/images/project-logos/sewikimedia.png [18:33:16] Cache-busting it with like ?poop works and gives me the new one [18:33:18] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [18:33:26] Jayprakash12345: It's live, just gonna take a bit for caches to all clear out [18:33:35] (mwdebug looks right because it skips cache layer) [18:33:51] I can purge it, or the script can I guess [18:34:02] but static images all have to have the hostname rewritte for purging to work right [18:34:48] no_justification: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Image_Cache_Purges [18:34:56] eg: echo "https://en.wikipedia.org/static/images/project-logos/newikibooks.png" | mwscript purgeList.php --wiki=enwiki [18:35:23] !log demon@tin Synchronized docroot/: dropping squid.php (duration: 00m 52s) [18:35:27] just replace newikibooks.png with the logo name [18:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:29] * hasharDinner eats more [18:36:59] !log demon@tin Synchronized wmf-config/CommonSettings.php: use reverse-proxy.php no more squid.php (duration: 00m 50s) [18:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:19] hasharDinner: I know :) [18:38:02] bblack: It's also just a logo update -- if it takes a bit to start showing then we'll live :) [18:39:33] !log demon@tin Synchronized wmf-config/: Dropping squid.php (hang on to your pants folks, this could be fun) (duration: 00m 51s) [18:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:14] getting rid of "squid" naming? [18:40:23] nice [18:41:53] (03PS1) 10BBlack: Revert "Global: runtime disable ethernet flow on fresh install" [puppet] - 10https://gerrit.wikimedia.org/r/387844 [18:42:07] (03PS2) 10BBlack: Revert "Global: runtime disable ethernet flow on fresh install" [puppet] - 10https://gerrit.wikimedia.org/r/387844 [18:42:21] (03CR) 10BBlack: [V: 032 C: 032] Revert "Global: runtime disable ethernet flow on fresh install" [puppet] - 10https://gerrit.wikimedia.org/r/387844 (owner: 10BBlack) [18:43:12] no_justification: anything else for me [18:43:19] No [18:43:44] bblack: Yeah, there was some bikeshedding over the name, went with squid.php -> reverse-proxy.php [18:44:21] of course reverse-proxy.php still contains $wgSquidServersNoPurge :) [18:44:41] no_justification: thank you Can you tell me how much time will take to new logo live? [18:44:53] It is live :) [18:44:57] Just cached ;-) [18:45:03] bblack: Well, blame MediaWiki :p [18:46:36] (03PS1) 10ArielGlenn: make dumps snapshot host roles more role-like [puppet] - 10https://gerrit.wikimedia.org/r/387846 (https://phabricator.wikimedia.org/T175528) [18:47:09] (03CR) 10jerkins-bot: [V: 04-1] make dumps snapshot host roles more role-like [puppet] - 10https://gerrit.wikimedia.org/r/387846 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [18:48:03] yeah we knew that. now I get to see how much I have to really fix [18:51:50] (03PS2) 10ArielGlenn: make dumps snapshot host roles more role-like [puppet] - 10https://gerrit.wikimedia.org/r/387846 (https://phabricator.wikimedia.org/T175528) [18:59:08] (03PS6) 10Ottomata: Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) [18:59:39] (03CR) 10jerkins-bot: [V: 04-1] Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) (owner: 10Ottomata) [19:00:04] no_justification: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171101T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:05:06] !log otto@tin Started deploy [analytics/refinery@6d11d67]: Deploying refinery-source artifacts for 0.0.54 for JsonRefine job, T162610 [19:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:12] T162610: Implement EventLogging Hive refinement - https://phabricator.wikimedia.org/T162610 [19:05:37] (03PS3) 10ArielGlenn: make dumps snapshot host roles more role-like [puppet] - 10https://gerrit.wikimedia.org/r/387846 (https://phabricator.wikimedia.org/T175528) [19:05:49] bblack: See https://phabricator.wikimedia.org/T104148#3727257 :-) [19:08:51] !log otto@tin Finished deploy [analytics/refinery@6d11d67]: Deploying refinery-source artifacts for 0.0.54 for JsonRefine job, T162610 (duration: 03m 45s) [19:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:05] (03PS1) 10BBlack: Revert "Global: Turn off ethernet flow for all interfaces at boot time" [puppet] - 10https://gerrit.wikimedia.org/r/387849 [19:09:12] (03PS2) 10BBlack: Revert "Global: Turn off ethernet flow for all interfaces at boot time" [puppet] - 10https://gerrit.wikimedia.org/r/387849 [19:09:16] (03CR) 10BBlack: [V: 032 C: 032] Revert "Global: Turn off ethernet flow for all interfaces at boot time" [puppet] - 10https://gerrit.wikimedia.org/r/387849 (owner: 10BBlack) [19:12:30] no_justification: I think I have a fix for T179430, how urgent is deployment? I’m happy to ask for a window now, or it can wait 4hr until the next SWAT. [19:12:30] T179430: ORES extension failing to parse scoring response - https://phabricator.wikimedia.org/T179430 [19:12:36] (03PS1) 10BBlack: lvs1001-6: increase bnx2 rx ring buffer [puppet] - 10https://gerrit.wikimedia.org/r/387850 [19:12:51] (03CR) 10Ottomata: [V: 032 C: 032] "Let's try it! :o" [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) (owner: 10Ottomata) [19:12:59] (03CR) 10Ottomata: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8589/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) (owner: 10Ottomata) [19:13:01] AFAIK, it’s the logspam plus API?oresscores requests are failing. Which I think are 3rd-party tools for now. [19:13:01] (03PS7) 10Ottomata: Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) [19:13:03] (03CR) 10Ottomata: [V: 032 C: 032] Refine Eventlogging analytics and eventbus data into Hive tables [puppet] - 10https://gerrit.wikimedia.org/r/387838 (https://phabricator.wikimedia.org/T162610) (owner: 10Ottomata) [19:14:30] !log all hosts: manual cumin+sed removal of ethernet autoneg params from /e/n/i to match https://gerrit.wikimedia.org/r/#/c/387849/ [19:14:33] (no_justification: oops, moving discussion to -releng) [19:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:54] (03CR) 10Bearloga: "> @Bearloga Got it. @Addshore: We will do the same for our analytics-wmde user." [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [19:20:13] (03PS4) 10ArielGlenn: make dumps snapshot host roles more role-like [puppet] - 10https://gerrit.wikimedia.org/r/387846 (https://phabricator.wikimedia.org/T175528) [19:21:27] (03CR) 10jerkins-bot: [V: 04-1] make dumps snapshot host roles more role-like [puppet] - 10https://gerrit.wikimedia.org/r/387846 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [19:22:05] (03PS1) 10Ottomata: Fix refinery_job_jar var [puppet] - 10https://gerrit.wikimedia.org/r/387853 [19:23:22] (03PS5) 10ArielGlenn: make dumps snapshot host roles more role-like [puppet] - 10https://gerrit.wikimedia.org/r/387846 (https://phabricator.wikimedia.org/T175528) [19:23:53] (03CR) 10Ottomata: [C: 032] Fix refinery_job_jar var [puppet] - 10https://gerrit.wikimedia.org/r/387853 (owner: 10Ottomata) [19:26:52] (03PS1) 10Ottomata: Fix (again) refinery_job_jar var [puppet] - 10https://gerrit.wikimedia.org/r/387854 [19:28:06] (03PS2) 10Ottomata: Fix (again) refinery_job_jar var and run refines at 20 and 30 mins [puppet] - 10https://gerrit.wikimedia.org/r/387854 [19:28:30] !log awight@tin Synchronized php-1.31.0-wmf.6/extensions/ORES: Fix for API=oresscores, T179430 (duration: 00m 52s) [19:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:37] T179430: ORES extension failing to parse scoring response - https://phabricator.wikimedia.org/T179430 [19:28:57] (03CR) 10Ottomata: [C: 032] Fix (again) refinery_job_jar var and run refines at 20 and 30 mins [puppet] - 10https://gerrit.wikimedia.org/r/387854 (owner: 10Ottomata) [19:31:16] (03PS1) 10Ottomata: opt name is --database, not --output-database [puppet] - 10https://gerrit.wikimedia.org/r/387855 [19:31:29] (03CR) 10Ottomata: [V: 032 C: 032] opt name is --database, not --output-database [puppet] - 10https://gerrit.wikimedia.org/r/387855 (owner: 10Ottomata) [19:31:56] !log awight@tin Synchronized php-1.31.0-wmf.5/extensions/ORES: Fix for API=oresscores, T179430 (duration: 00m 50s) [19:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:49] (03PS1) 10Ottomata: Include job_name when checking if json refine job is running [puppet] - 10https://gerrit.wikimedia.org/r/387857 [19:40:10] (03CR) 10Ottomata: [C: 032] Include job_name when checking if json refine job is running [puppet] - 10https://gerrit.wikimedia.org/r/387857 (owner: 10Ottomata) [19:41:35] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.4 [keeping static files] (duration: 01m 52s) [19:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:42] no_justification: Once the train is done, I'd like to roll out demon as soon as possible. [19:42:50] err, copypaste fail. I mean https://gerrit.wikimedia.org/r/#/c/387856/ [19:43:02] Unbreak IE10 JS [19:43:50] Krinkle: Go ahead now if you want, I haven't done my wikiversions.json bump yet, and I wanna eat first anyway [19:43:59] Got it. Thanks! [19:53:05] !log demon@tin Synchronized php-1.31.0-wmf.5/extensions/CentralNotice: fix weird git rebasing issue (duration: 00m 53s) [19:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:41] (03CR) 10Chad: [C: 032] group1 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387818 (owner: 10Chad) [19:56:08] PROBLEM - puppet last run on thumbor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:56:30] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3727766 (10Addshore) @Legoktm Just make a change that moves a paragraph. Here is one of my tests on testwiki not working while I was te... [19:57:10] !log krinkle@tin Synchronized php-1.31.0-wmf.6/resources/src/startup.js: T178943 (duration: 00m 51s) [19:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:17] T178943: startUp() callback sometimes happen before 'mw' is defined in IE10 - https://phabricator.wikimedia.org/T178943 [19:59:32] (03Merged) 10jenkins-bot: group1 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387818 (owner: 10Chad) [19:59:41] (03CR) 10jenkins-bot: group1 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387818 (owner: 10Chad) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Parsoid / OCG / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171101T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:30] no parsoid deploy today [20:03:37] (03CR) 10Zoranzoki21: "> Removed reviewer Zoranzoki21 with the following votes:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387832 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester) [20:04:55] (03CR) 10Zoranzoki21: [C: 031] labs: use new redis servers for locks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387570 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [20:05:35] !log demon@tin Synchronized php: symlink swap (duration: 00m 49s) [20:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:24] (03PS6) 10ArielGlenn: make dumps snapshot host roles more role-like [puppet] - 10https://gerrit.wikimedia.org/r/387846 (https://phabricator.wikimedia.org/T175528) [20:07:53] (03CR) 10ArielGlenn: [C: 032] make dumps snapshot host roles more role-like [puppet] - 10https://gerrit.wikimedia.org/r/387846 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [20:09:28] (03PS2) 10BBlack: lvs1001-6: increase bnx2 rx ring buffer [puppet] - 10https://gerrit.wikimedia.org/r/387850 [20:10:19] (03CR) 10BBlack: [C: 032] lvs1001-6: increase bnx2 rx ring buffer [puppet] - 10https://gerrit.wikimedia.org/r/387850 (owner: 10BBlack) [20:10:26] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.6 [20:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:46] (03PS1) 10Ottomata: Run JsonRefine job in yarn deploy mode cluster and provide hive-site.xml [puppet] - 10https://gerrit.wikimedia.org/r/387860 (https://phabricator.wikimedia.org/T162610) [20:11:04] (03PS2) 10Ottomata: Run JsonRefine job in yarn deploy mode cluster and provide hive-site.xml [puppet] - 10https://gerrit.wikimedia.org/r/387860 (https://phabricator.wikimedia.org/T162610) [20:11:51] (03CR) 10jerkins-bot: [V: 04-1] Run JsonRefine job in yarn deploy mode cluster and provide hive-site.xml [puppet] - 10https://gerrit.wikimedia.org/r/387860 (https://phabricator.wikimedia.org/T162610) (owner: 10Ottomata) [20:12:47] (03PS3) 10Ottomata: Run JsonRefine job in yarn deploy mode cluster and provide hive-site.xml [puppet] - 10https://gerrit.wikimedia.org/r/387860 (https://phabricator.wikimedia.org/T162610) [20:13:00] (03PS4) 10Ottomata: Run JsonRefine job in yarn deploy mode cluster and provide hive-site.xml [puppet] - 10https://gerrit.wikimedia.org/r/387860 (https://phabricator.wikimedia.org/T162610) [20:13:07] (03CR) 10Ottomata: [V: 032 C: 032] Run JsonRefine job in yarn deploy mode cluster and provide hive-site.xml [puppet] - 10https://gerrit.wikimedia.org/r/387860 (https://phabricator.wikimedia.org/T162610) (owner: 10Ottomata) [20:16:16] (03PS1) 10Ayounsi: Netbox scap3 initial commit [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/387861 [20:18:36] (03PS1) 10BBlack: LVS+Caches: disable Ethernet flowcontrol [puppet] - 10https://gerrit.wikimedia.org/r/387863 [20:18:38] (03PS1) 10BBlack: interface::noflow - runtime disable on fresh install [puppet] - 10https://gerrit.wikimedia.org/r/387864 [20:18:40] (03PS1) 10ArielGlenn: fix role name in snapshot motds [puppet] - 10https://gerrit.wikimedia.org/r/387865 [20:19:39] (03CR) 10ArielGlenn: [C: 032] fix role name in snapshot motds [puppet] - 10https://gerrit.wikimedia.org/r/387865 (owner: 10ArielGlenn) [20:20:35] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3727841 (10bd808) >>! In T179464#3727235, @Superyetkin wrote: > Any chance to recover data? We are working right now to see how many bad blocks/sectors there are o... [20:21:08] RECOVERY - puppet last run on thumbor2001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:28:03] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3727851 (10Jgreen) [20:28:06] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: connect second interface for each frack to opposite switch for each eqiad host - https://phabricator.wikimedia.org/T176975#3727849 (10Jgreen) 05Open>03Resolved a:03Jgreen [20:28:16] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3727852 (10Tobi_WMDE_SW) Wondering why it worked on beta, it should have been broken since https://gerrit.wikimedia.org/r/#/c/377804/ I... [20:31:22] (03PS1) 10Ottomata: Camus imports eventbus data into /wmf/data/raw/event [puppet] - 10https://gerrit.wikimedia.org/r/387869 [20:34:51] (03CR) 10Ottomata: [C: 032] Camus imports eventbus data into /wmf/data/raw/event [puppet] - 10https://gerrit.wikimedia.org/r/387869 (owner: 10Ottomata) [20:34:58] (03PS2) 10Ottomata: Camus imports eventbus data into /wmf/data/raw/event [puppet] - 10https://gerrit.wikimedia.org/r/387869 [20:35:00] (03CR) 10Ottomata: [V: 032 C: 032] Camus imports eventbus data into /wmf/data/raw/event [puppet] - 10https://gerrit.wikimedia.org/r/387869 (owner: 10Ottomata) [20:43:51] (03CR) 10Chad: "Because you're constantly +1ing my changes without any context or idea if they're even ok. You +1'd a change I explicitly -2'd last week. " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387832 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester) [20:43:59] (03PS1) 10ArielGlenn: add dumpsgen user to the snapshots hosts [puppet] - 10https://gerrit.wikimedia.org/r/387875 [20:50:46] (03CR) 10ArielGlenn: [C: 032] add dumpsgen user to the snapshots hosts [puppet] - 10https://gerrit.wikimedia.org/r/387875 (owner: 10ArielGlenn) [20:53:03] (03PS2) 10BBlack: LVS+Caches: disable Ethernet flowcontrol [puppet] - 10https://gerrit.wikimedia.org/r/387863 [20:53:50] (03CR) 10BBlack: [C: 032] LVS+Caches: disable Ethernet flowcontrol [puppet] - 10https://gerrit.wikimedia.org/r/387863 (owner: 10BBlack) [20:57:35] (03PS1) 10Ayounsi: Add fake keys for Netbox deployment [labs/private] - 10https://gerrit.wikimedia.org/r/387878 [20:58:09] no_justification: Your thoughts on https://gerrit.wikimedia.org/r/#/c/387877/ (death to $wg…Squid… variables) would be appreciated. [21:01:21] (03PS1) 10Ayounsi: Netbox: initial puppet commit [puppet] - 10https://gerrit.wikimedia.org/r/387880 [21:01:58] (03CR) 10jerkins-bot: [V: 04-1] Netbox: initial puppet commit [puppet] - 10https://gerrit.wikimedia.org/r/387880 (owner: 10Ayounsi) [21:07:39] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:08:47] James_F: tldr right now: I don't think it's worth the dang effort [21:08:56] Back-compat shims that will sit around /forever/ [21:10:02] wmf-config, fine whatever it's easy to fix our stu [21:10:04] *stuff [21:12:58] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [21:15:58] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [21:18:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [21:23:54] plus Squid is a cute name, and reminds of our infrastructural legacy :) [21:24:38] (03PS2) 10BBlack: interface::noflow - runtime disable on fresh install [puppet] - 10https://gerrit.wikimedia.org/r/387864 [21:24:44] (03CR) 10BBlack: [C: 032] interface::noflow - runtime disable on fresh install [puppet] - 10https://gerrit.wikimedia.org/r/387864 (owner: 10BBlack) [21:24:59] PROBLEM - Long running screen/tmux on mwlog1001 is CRITICAL: CRIT: Long running SCREEN process. (PID: 56811, 1729810s 1728000s). [21:27:50] bblack: +1 [21:29:08] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [21:29:37] (03CR) 10BBlack: "Only thing I'm really worried about here, is we do send X-Cache-Status to analytics webrequest stream in modules/profile/manifests/cache/k" [puppet] - 10https://gerrit.wikimedia.org/r/387817 (owner: 10Ema) [21:32:07] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695340 (10Krinkle) Continuing from T178538#3699577, the Last Call period for this RFC has expired and TechCom has decided to cancel it's proposed "Approval" based on t... [21:34:29] (03PS1) 10ArielGlenn: add dumpsgen to sudo rules for the appropriate admin groups [puppet] - 10https://gerrit.wikimedia.org/r/387917 [21:37:39] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:42:46] (03PS2) 10Ayounsi: Netbox: initial puppet commit [puppet] - 10https://gerrit.wikimedia.org/r/387880 [21:43:53] (03CR) 10jerkins-bot: [V: 04-1] Netbox: initial puppet commit [puppet] - 10https://gerrit.wikimedia.org/r/387880 (owner: 10Ayounsi) [21:46:58] (03CR) 10ArielGlenn: [C: 032] add dumpsgen to sudo rules for the appropriate admin groups [puppet] - 10https://gerrit.wikimedia.org/r/387917 (owner: 10ArielGlenn) [21:57:01] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3728033 (10Addshore) >>! In T177891#3727852, @Tobi_WMDE_SW wrote: > Wondering why it worked on beta, it should have been broken since h... [22:01:53] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3728041 (10Smalyshev) Does this proposal mean we'd have to migrate all PHP 5.x services to hhvm, with knowledge that we'll have to migrate them to PHP 7 at some later p... [22:08:43] (03PS1) 10Legoktm: Disable REL1_28 in ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387936 [22:09:57] no_justification: ^ [22:10:10] (03CR) 10Chad: [C: 032] Disable REL1_28 in ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387936 (owner: 10Legoktm) [22:11:44] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Scap: Standardize git version - https://phabricator.wikimedia.org/T179353#3721967 (10greg) From moritz: P6242 (machines still running trusty) [22:13:42] (03Merged) 10jenkins-bot: Disable REL1_28 in ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387936 (owner: 10Legoktm) [22:17:31] !log demon@tin Synchronized wmf-config/CommonSettings.php: rel1_28 is dead (duration: 00m 50s) [22:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:04] (03PS3) 10Ayounsi: Netbox: initial puppet commit [puppet] - 10https://gerrit.wikimedia.org/r/387880 [22:18:23] legoktm: Done ^ [22:19:06] (03CR) 10jerkins-bot: [V: 04-1] Netbox: initial puppet commit [puppet] - 10https://gerrit.wikimedia.org/r/387880 (owner: 10Ayounsi) [22:19:08] (03CR) 10jenkins-bot: Disable REL1_28 in ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387936 (owner: 10Legoktm) [22:19:23] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3728097 (10daniel) >>! In T178538#3728041, @Smalyshev wrote: > Does this proposal mean we'd have to migrate all PHP 5.x services to hhvm, with knowledge that we'll have... [22:23:24] (03CR) 10Hashar: [C: 04-1] "00:00:20.214 modules/netbox/templates/ldap_config.py.erb:23:# heirarchy." [puppet] - 10https://gerrit.wikimedia.org/r/387880 (owner: 10Ayounsi) [22:23:36] XioNoX: ^^ [22:23:42] (03PS4) 10Ayounsi: Netbox: initial puppet commit [puppet] - 10https://gerrit.wikimedia.org/r/387880 [22:24:10] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3728124 (10Krinkle) >>! In T178538#3728041, @Smalyshev wrote: > Does this proposal mean we'd have to migrate all PHP 5.x services to hhvm, with knowledge that we'll hav... [22:24:16] hashar: what's up? [22:24:17] (03CR) 10jerkins-bot: [V: 04-1] Netbox: initial puppet commit [puppet] - 10https://gerrit.wikimedia.org/r/387880 (owner: 10Ayounsi) [22:24:19] XioNoX: and in theory you could add the test command in a git hook locally :] [22:25:08] XioNoX: I gave you some hint on https://gerrit.wikimedia.org/r/#/c/387880/ to reproduce the tests locally :] should be faster than sending to gerrit / waiting for CI [22:25:31] bundle install && bundle exec rake --jobs 1 test :D [22:25:49] thx! [22:25:52] I'll do that [22:27:01] /usr/bin/ruby2.3: No such file or directory -- /usr/share/rubygems-integration/all/gems/rake-12.0.0/exe/rake (LoadError) [22:27:15] Gem::Ext::BuildError: ERROR: Failed to build gem native extension. [22:28:01] (03PS1) 10ArielGlenn: mount nfs share from dumpsdata host on snapshots [puppet] - 10https://gerrit.wikimedia.org/r/387951 [22:28:20] is there something like virtualenv for ruby? [22:28:45] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Scap: Standardize git version - https://phabricator.wikimedia.org/T179353#3728176 (10greg) From that paste and https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/common/scap/dsh.yaml * snapshot hosts... [22:30:10] hashar: ^ :) [22:30:42] XioNoX: yeah bundler [22:31:03] apt-get install bundler [22:31:19] it would use gems to download/install gems somewhere in your home [22:31:26] then when you do: bundle exec FOO [22:31:39] bundler install doesn't work [22:31:42] it mangles the RUBYPATH and PATH to point to your gems in the home [22:31:44] hmm [22:31:49] try "bundle update" ? [22:31:55] or maybe it is "bundle install [22:32:04] bundle vs bundler [22:32:31] same isssue with both bundle/bundler install/update [22:32:41] :( [22:33:00] https://www.irccloud.com/pastebin/QX3NAsh0/ [22:33:28] ERROR: Failed to build gem native extension. [22:33:35] there is one of the extensions that requires some compilation [22:33:41] !log group2(all-wikidata) wikis to wmf.5 from 24 hours ago seems to have caused a 60% drop in navigation timing metric report count (100/min => 40/min) [22:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:48] Need to go, but will investigate when I return [22:33:57] https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-browser?var-metric=mediaWikiLoadComplete&panelId=6&fullscreen&orgId=1&from=1509412097543&to=1509511254065&refresh=5m [22:34:11] https://grafana.wikimedia.org/dashboard/db/navigation-timing?refresh=5m&panelId=12&fullscreen&orgId=1&from=now-2d&to=now&var-metric=mediaWikiLoadComplete [22:34:54] XioNoX: /usr/bin/ruby2.3: No such file or directory -- /usr/share/rubygems-integration/all/gems/rake-12.0.0/exe/rake (LoadError) . I am 100% sure I had the issue before [22:35:08] XioNoX hi, try apt-get install ruby-dev [22:35:12] https://stackoverflow.com/questions/22544754/failed-to-build-gem-native-extension-installing-compass [22:36:42] (03CR) 10ArielGlenn: [C: 032] mount nfs share from dumpsdata host on snapshots [puppet] - 10https://gerrit.wikimedia.org/r/387951 (owner: 10ArielGlenn) [22:37:46] and [22:37:47] gem install rake [22:38:07] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3728207 (10Tobi_WMDE_SW) @Addshore right! as long as it is > 0.3.0 it should work. [22:39:22] XioNoX: cant find any note sorry :( [22:40:53] (03PS6) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) [22:41:26] (03PS7) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) [22:44:45] (03PS1) 10ArielGlenn: remove hiera keys for snapshots that we no longer need [puppet] - 10https://gerrit.wikimedia.org/r/387955 [22:50:04] (03CR) 10ArielGlenn: [C: 032] remove hiera keys for snapshots that we no longer need [puppet] - 10https://gerrit.wikimedia.org/r/387955 (owner: 10ArielGlenn) [22:55:48] PROBLEM - puppet last run on wtp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171101T2300). [23:00:04] Smalyshev and legoktm: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:19] hi [23:00:24] here [23:02:36] i added stuff as well that doesn't seem to have made it into jouncebot [23:03:00] jouncebot: reload [23:03:06] jouncebot: refresh [23:03:10] I refreshed my knowledge about deployments. [23:03:21] I can SWAT [23:03:44] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386554 (https://phabricator.wikimedia.org/T148411) (owner: 10Smalyshev) [23:05:23] (03Merged) 10jenkins-bot: Revert "Revert "Add negative weight to disambig entities"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386554 (https://phabricator.wikimedia.org/T148411) (owner: 10Smalyshev) [23:06:24] SMalyshev: Revert "Revert "Add negative weight to disambig entities"" is on mwdebug1002, if there's anything to check there [23:06:33] (03CR) 10jenkins-bot: Revert "Revert "Add negative weight to disambig entities"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386554 (https://phabricator.wikimedia.org/T148411) (owner: 10Smalyshev) [23:06:39] thcipriani: checking [23:07:35] thcipriani: yep, seems to be working fine! [23:07:43] SMalyshev: ok, going live [23:09:42] !log thcipriani@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:386554|Revert "Revert "Add negative weight to disambig entities""]] T148411 (duration: 00m 51s) [23:09:46] ^ SMalyshev live now [23:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:50] T148411: Item search for statements ranks disambiguation items too highly - https://phabricator.wikimedia.org/T148411 [23:10:04] thcipriani: thanks, it's working! [23:10:16] awesome :) [23:11:59] thcipriani: for the second one, it's Wikidata, so the wikidata extension patch is the one that does the work, the other one is to keep wikibase repo in sync [23:12:13] (that's what Amir1 told me to do :) [23:12:41] okie doke, makes sense [23:18:16] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3728323 (10madhuvishy) It looks like it may be time to say goodbye to this server. I've spent some time today looking at the state of the storage configuration, and... [23:18:20] SMalyshev: Wikidata extension update is live on mwdebug1002, check please [23:19:13] checking [23:19:59] thcipriani: yep, works [23:20:04] ok, going live [23:23:26] !log thcipriani@tin Synchronized php-1.31.0-wmf.6/extensions/Wikidata/extensions/Wikibase/repo/Wikibase.hooks.php: SWAT: [[gerrit:387749|Allow turning Cirrus usage off from query]] T179428 (duration: 00m 51s) [23:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:33] T179428: Can not enable old SQL prefix search mode on wikidata - https://phabricator.wikimedia.org/T179428 [23:25:01] !log thcipriani@tin Synchronized php-1.31.0-wmf.6/extensions/Wikibase/repo/Wikibase.hooks.php: SWAT: [[gerrit:387662|Allow turning Cirrus usage off from query]] T179428 (duration: 00m 49s) [23:25:06] ^ SMalyshev all live [23:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:48] RECOVERY - puppet last run on wtp2013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:26:27] legoktm: your namespace fix is live on mwdebug1002, check please [23:26:29] thcipriani: thank you! everything seems to be fine [23:26:51] SMalyshev: yw :) glad to hear it! [23:27:30] thcipriani: lgtm [23:27:35] going live [23:29:47] !log thcipriani@tin Synchronized php-1.31.0-wmf.6/extensions/ParserMigration/includes/ApiParserMigration.php: SWAT: [[gerrit:387954|API: Fix WikiPage namespace]] (duration: 00m 52s) [23:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:56] ^ legoktm live everywhere [23:31:28] thanks! [23:31:31] ebernhardson: WikimediaEvents update is live for both wmf.{5,6} on mwdebug1002, check please [23:31:44] yw, thanks for the patch :) [23:33:11] thcipriani: seems reasonable enough. not awhole lot that can be tested [23:33:27] okie doke, going live wmf.6 first [23:35:42] !log thcipriani@tin Synchronized php-1.31.0-wmf.6/extensions/WikimediaEvents: SWAT: [[gerrit:387957|Turn on Cirrus AB test for DBN group sizing]] (duration: 00m 51s) [23:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:18] !log thcipriani@tin Synchronized php-1.31.0-wmf.5/extensions/WikimediaEvents: SWAT: [[gerrit:387956|Turn on Cirrus AB test for DBN group sizing]] (duration: 00m 50s) [23:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:25] ^ ebernhardson all live [23:38:06] thcipriani: thanks! keeping an eye on event counts, will take a few minutes to actually get to users [23:38:40] 10Operations, 10Deployments, 10Beta-Cluster-reproducible, 10HHVM, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7) - https://phabricator.wikimedia.org/T146285#3728364 (10hashar) I am fine with https://gerrit.wikimedia.org/r/#/c/358896/ would want to schedule it a... [23:38:53] cool :) [23:39:48] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3728368 (10hashar) //Stop forcing php5 in `mwscript`// (https://gerrit.wikimedia.org/r/#/c/358896/). Well we just have to do the switch and see what happens I guess, it... [23:48:29] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [23:52:38] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [23:59:38] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0