[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181219T0000). [00:00:04] mdholloway: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:59] * mdholloway is here [00:03:07] I can SWAT [00:03:19] thcipriani: cool, thanks [00:03:27] (03PS7) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794 [00:05:25] (03PS3) 10Bstorm: wmcs: catch and log view drop errors in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/479576 (https://phabricator.wikimedia.org/T211940) (owner: 10BryanDavis) [00:06:38] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10aaron) The current callers don't assume the level of durability as with mysql, just that the da... [00:06:45] (03CR) 10Dzahn: [C: 03+2] "rsyncd got installed and is listening. user has been created and owns the doc root" [puppet] - 10https://gerrit.wikimedia.org/r/480573 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [00:08:43] (03CR) 10Dzahn: [C: 03+2] "[contint1001:~] $ rsync -avp ./foo/ rsync://doc1001.eqiad.wmnet/doc" [puppet] - 10https://gerrit.wikimedia.org/r/480573 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [00:09:06] (03CR) 10Dzahn: [C: 03+2] "rsync works, tested as above" [puppet] - 10https://gerrit.wikimedia.org/r/480573 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [00:10:00] (03CR) 10Bstorm: [C: 03+2] wmcs: catch and log view drop errors in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/479576 (https://phabricator.wikimedia.org/T211940) (owner: 10BryanDavis) [00:11:18] !log contint1001 - rsyncing /srv/org/wikimedia/docs to rsync://docs1001.eqiad.wmnet/docs T211974 [00:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:21] T211974: eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 [00:14:59] (03CR) 10BBlack: New zone generator gen-zones.py (034 comments) [dns] - 10https://gerrit.wikimedia.org/r/479892 (owner: 10BBlack) [00:15:06] mdholloway: change is live on mwdebug1002, check please [00:15:09] thcipriani, mdholloway: er wait [00:15:18] this is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Kartographer/+/480527 right? [00:15:23] the patch won't work AFAICT [00:15:23] (03PS11) 10BBlack: New zone generator gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/479892 [00:17:20] I can revert. currently only on mwdebug1002. [00:17:26] writing a patch [00:17:31] legoktm: yes, that's the patch. sorry, what does it need to be doing that it isn't? [00:17:38] legoktm: thank you [00:18:11] ah, saw comment [00:19:01] see [16:18:56] (PS1) Legoktm: Fix using at-ease functions in namespaced class [extensions/Kartographer] - https://gerrit.wikimedia.org/r/480678 (https://phabricator.wikimedia.org/T212218) [00:19:04] I haven't tested it [00:23:00] mdholloway: if you can test and review that patch I can backport it, if you want more time than the swat window to review, I can revert and can schedule for later [00:23:42] thcipriani: let's reschedule for tomorrow. [00:23:59] mdholloway: okie doke, reverting the first patch. [00:24:53] legoktm: i assume ApiGraph.php currently needs fixing too, then? (see https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Graph/+/470292/) [00:25:23] i can fix that one (unless there's something i'm missing and it's actually correct in that case) [00:25:27] mdholloway: yes [00:25:33] thcipriani: thank you. [00:25:35] legoktm: thanks [00:29:18] (03CR) 10Dzahn: [C: 03+2] "i rsynced the entire data from contint1001 with rsync -avp /srv/org/wikimedia/doc/ rsync://doc1001.eqiad.wmnet/doc" [puppet] - 10https://gerrit.wikimedia.org/r/480573 (https://phabricator.wikimedia.org/T211974) (owner: 10Dzahn) [00:32:24] (03CR) 10Dzahn: [C: 03+2] "i got "bundle install" running after installing ruby-dev packages and then followed your docs." [puppet] - 10https://gerrit.wikimedia.org/r/480587 (owner: 10Hashar) [00:32:53] (revert has been pulled to deployment server and mwdebug1002) [00:46:46] (03CR) 10Dzahn: "wmf-style: total violations delta 7 but i am not adding them :) "parameter .. of class 'profile::dumps::nfs' has no call to hiera" is tr" [puppet] - 10https://gerrit.wikimedia.org/r/479335 (owner: 10Dzahn) [01:06:37] PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:15:45] PROBLEM - puppet last run on analytics1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:25:05] !log tstarling@deploy1001 Synchronized php-1.33.0-wmf.8/extensions/AbuseFilter/includes/AbuseFilter.php: g 480680 fix exception in maintenance script (duration: 00m 54s) [01:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:39] !log tstarling@deploy1001 Synchronized php-1.33.0-wmf.8/extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php: g 480681 make maintenance script dry run more useful (duration: 00m 52s) [01:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:45] RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:44:44] (03PS3) 10Dzahn: dumps:nfs: add data types, move variables to parameters and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/479335 [01:45:52] (03CR) 10jerkins-bot: [V: 04-1] dumps:nfs: add data types, move variables to parameters and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/479335 (owner: 10Dzahn) [01:46:17] (03PS4) 10Dzahn: dumps:nfs: add data types, move variables to parameters and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/479335 [01:46:55] RECOVERY - puppet last run on analytics1059 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:47:13] (03CR) 10jerkins-bot: [V: 04-1] dumps:nfs: add data types, move variables to parameters and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/479335 (owner: 10Dzahn) [01:51:08] (03PS5) 10Dzahn: dumps:nfs: add data types, move variables to parameters and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/479335 [01:53:29] (03PS6) 10Dzahn: dumps:nfs: add data types, move variables to parameters and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/479335 [01:53:41] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/479335 (owner: 10Dzahn) [01:55:33] PROBLEM - MariaDB Slave Lag: s3 on db1095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 733.47 seconds [02:01:30] (03CR) 10Dzahn: [C: 04-1] "build successful but " parameter 'clients' expects a String value, got Struct" nevertheless if you click through to the catalog" [puppet] - 10https://gerrit.wikimedia.org/r/479335 (owner: 10Dzahn) [02:06:08] (03PS7) 10Dzahn: dumps:nfs: add data types, move variables to parameters and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/479335 [02:07:54] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/479335 (owner: 10Dzahn) [02:13:06] (03PS5) 10Dzahn: hadoop::ui: migrate from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/474832 [02:15:21] (03CR) 10Dzahn: hadoop::ui: migrate from apache to httpd module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474832 (owner: 10Dzahn) [02:15:52] 10Operations, 10netops: Spike of multicast traffic - https://phabricator.wikimedia.org/T212273 (10ayounsi) p:05Triage→03High [02:17:03] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/13999/dumpsdata1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/479335 (owner: 10Dzahn) [02:19:38] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/14001/analytics-tool1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/474832 (owner: 10Dzahn) [02:24:05] (03PS6) 10Dzahn: hadoop::ui: migrate from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/474832 [03:37:35] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 961.46 seconds [03:43:14] (03PS2) 10Tim Starling: Un-revert "Refactor profiler.php and X-Wikimedia-Debug parsing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480419 [03:45:06] (03CR) 10KartikMistry: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [03:45:52] (03CR) 10jerkins-bot: [V: 04-1] WIP: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [03:46:59] RECOVERY - MariaDB Slave Lag: s3 on db1095 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [03:47:05] PROBLEM - mysqld processes on db2057 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:47:13] PROBLEM - Disk space on db2057 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [03:47:17] PROBLEM - Check systemd state on db2057 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:47:35] PROBLEM - MariaDB disk space on db2057 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [03:47:47] PROBLEM - MariaDB Slave SQL: s3 on db2057 is CRITICAL: CRITICAL slave_sql_state could not connect [03:48:53] PROBLEM - MariaDB Slave IO: s3 on db2057 is CRITICAL: CRITICAL slave_io_state could not connect [03:52:12] (03PS2) 10KartikMistry: WIP: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 [03:53:07] (03CR) 10jerkins-bot: [V: 04-1] WIP: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [03:53:53] (03PS6) 10Tim Starling: Class wrapper for ProductionServices.php etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477956 [03:53:55] (03PS6) 10Tim Starling: Put profiler hostnames in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477957 [03:53:57] (03PS7) 10Tim Starling: Excimer and Tideways support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478137 [03:55:17] (03PS3) 10Tim Starling: Un-revert "Refactor profiler.php and X-Wikimedia-Debug parsing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480419 [03:55:19] (03PS7) 10Tim Starling: Class wrapper for ProductionServices.php etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477956 [03:55:21] (03PS7) 10Tim Starling: Put profiler hostnames in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477957 [03:55:23] (03PS8) 10Tim Starling: Excimer and Tideways support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478137 [03:55:45] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:58:28] (03PS3) 10KartikMistry: WIP: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 [03:59:24] (03CR) 10jerkins-bot: [V: 04-1] WIP: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [04:27:05] (03PS4) 10KartikMistry: WIP: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 [04:28:00] (03CR) 10jerkins-bot: [V: 04-1] WIP: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [04:30:47] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 291.26 seconds [04:54:43] (03PS1) 10KartikMistry: cxserver: Update description [puppet] - 10https://gerrit.wikimedia.org/r/480694 [04:59:39] (03PS2) 10KartikMistry: cxserver: Update description [puppet] - 10https://gerrit.wikimedia.org/r/480694 [05:00:19] (03CR) 10jerkins-bot: [V: 04-1] cxserver: Update description [puppet] - 10https://gerrit.wikimedia.org/r/480694 (owner: 10KartikMistry) [05:01:04] (03Abandoned) 10KartikMistry: cxserver: Update description [puppet] - 10https://gerrit.wikimedia.org/r/480694 (owner: 10KartikMistry) [05:03:01] (03PS5) 10KartikMistry: WIP: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 [05:03:36] (03CR) 10jerkins-bot: [V: 04-1] WIP: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [05:06:58] (03PS1) 10Tim Starling: When running scripts from staging, use the CommonSettings.php from staging [puppet] - 10https://gerrit.wikimedia.org/r/480695 [06:05:13] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [06:06:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [06:18:14] (03PS1) 10Marostegui: db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480698 (https://phabricator.wikimedia.org/T212275) [06:20:00] (03PS1) 10Marostegui: db2057: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/480699 (https://phabricator.wikimedia.org/T212275) [06:24:08] (03CR) 10Marostegui: [C: 03+2] db2057: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/480699 (https://phabricator.wikimedia.org/T212275) (owner: 10Marostegui) [06:24:34] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480698 (https://phabricator.wikimedia.org/T212275) (owner: 10Marostegui) [06:25:39] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480698 (https://phabricator.wikimedia.org/T212275) (owner: 10Marostegui) [06:27:14] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2057 - storage crashed T212275 (duration: 01m 08s) [06:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:17] T212275: db2057 storage crashed - https://phabricator.wikimedia.org/T212275 [06:31:52] (03CR) 10jenkins-bot: db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480698 (https://phabricator.wikimedia.org/T212275) (owner: 10Marostegui) [06:34:05] PROBLEM - puppet last run on analytics1073 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:34:15] !log Hard reboot db2057 - T212275 [06:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:18] T212275: db2057 storage crashed - https://phabricator.wikimedia.org/T212275 [06:38:35] RECOVERY - MariaDB disk space on db2057 is OK: DISK OK [06:43:09] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10akosiaris) >>! In T212129#4833043, @aaron wrote: > The current callers don't assume the level o... [06:44:15] !log Remove nodepool@10.64.16.155 user from m5 master - T212230 [06:44:15] RECOVERY - Disk space on db2057 is OK: DISK OK [06:44:17] RECOVERY - Check systemd state on db2057 is OK: OK - running: The system is fully operational [06:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:18] T212230: [DBA] remove nodepooldb on production-m5 and nodepool user - https://phabricator.wikimedia.org/T212230 [06:45:32] (03PS2) 10Marostegui: nodepool: cleanup database related settings [puppet] - 10https://gerrit.wikimedia.org/r/480661 (https://phabricator.wikimedia.org/T212230) (owner: 10Hashar) [06:47:29] (03CR) 10Marostegui: [C: 03+2] nodepool: cleanup database related settings [puppet] - 10https://gerrit.wikimedia.org/r/480661 (https://phabricator.wikimedia.org/T212230) (owner: 10Hashar) [06:47:45] (03PS1) 10Marostegui: install_server: Allow reimage db2057 [puppet] - 10https://gerrit.wikimedia.org/r/480700 (https://phabricator.wikimedia.org/T212275) [06:48:25] (03PS2) 10Marostegui: install_server: Allow reimage db2057 [puppet] - 10https://gerrit.wikimedia.org/r/480700 (https://phabricator.wikimedia.org/T212275) [06:49:53] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db2057 [puppet] - 10https://gerrit.wikimedia.org/r/480700 (https://phabricator.wikimedia.org/T212275) (owner: 10Marostegui) [06:58:18] !log Enable GTID on s1 codfw master (db2048) - T211973 [06:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:21] T211973: Check GTID, consistency options, notifications across the fleet and db-eqiad.php weights - https://phabricator.wikimedia.org/T211973 [06:59:58] !log Enable GTID on s8 codfw master (db2045) - T211973 [07:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:01] RECOVERY - puppet last run on analytics1073 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:10:19] (03PS1) 10Marostegui: db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480701 (https://phabricator.wikimedia.org/T212275) [07:12:44] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480701 (https://phabricator.wikimedia.org/T212275) (owner: 10Marostegui) [07:13:31] 10Operations, 10ops-codfw, 10DBA: Upgrade db2057 firmware - https://phabricator.wikimedia.org/T212277 (10Marostegui) p:05Triage→03Normal [07:13:49] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480701 (https://phabricator.wikimedia.org/T212275) (owner: 10Marostegui) [07:15:08] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2050 to clone db2057 T212275 (duration: 00m 52s) [07:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:11] T212275: db2057 storage crashed - https://phabricator.wikimedia.org/T212275 [07:18:05] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) Looking at live data, we have at least one shard that's doing evictions (150k of them) and... [07:22:17] (03CR) 10jenkins-bot: db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480701 (https://phabricator.wikimedia.org/T212275) (owner: 10Marostegui) [07:22:33] !log Stop MySQL on db2050 to clone db2057 - T212275 [07:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:36] T212275: db2057 storage crashed - https://phabricator.wikimedia.org/T212275 [07:28:53] (03CR) 10Elukey: [WIP] Add remaining kerberos wrapped commands (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [07:34:14] (03PS11) 10Elukey: [WIP] Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [07:37:09] !log Drop nodepooldb on m5 master - T212230 [07:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:13] T212230: [DBA] remove nodepooldb on production-m5 and nodepool user - https://phabricator.wikimedia.org/T212230 [07:38:38] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10Marostegui) [07:56:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, shall I merge or are there any pending steps to drop nodepool remaining?" [puppet] - 10https://gerrit.wikimedia.org/r/480546 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [08:16:20] !log swift eqiad-prod: more weight for ms-be10[44-50].eqiad.wmnet - T209618 [08:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:23] T209618: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 [08:19:53] (03PS12) 10Elukey: [WIP] Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [08:27:20] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) Using accept-language is not an option, at least not the accept-language from the browser. The relevant list of languages comes from u... [08:27:20] !log Drop image_comment_temp from s1 - T209591 [08:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:23] T209591: Drop table image_comment_temp on all wikis - https://phabricator.wikimedia.org/T209591 [08:27:25] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14005/ - Ready for a review!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [08:28:42] !log Drop image_comment_temp from s2 - T209591 [08:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:49] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) I agree with Joe that it would be better to have the service be internal, and be called from MW. It doesn't //have// to be that way, b... [08:37:47] !log draining restbase2007 for eventual reboot for kernel security update [08:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:21] !log Drop image_comment_temp from s8 - T209591 [08:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:24] T209591: Drop table image_comment_temp on all wikis - https://phabricator.wikimedia.org/T209591 [08:40:36] 10Operations, 10Release-Engineering-Team, 10Scap, 10User-ArielGlenn: Make scap and opcache work consistently together - https://phabricator.wikimedia.org/T211964 (10Joe) [08:43:44] !log rebalance row_A ganeti01.svc.codfw.wmnet nodegroup after recabling T210447 [08:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:47] T210447: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 [08:46:41] !log Drop image_comment_temp from s6 - T209591 [08:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:44] T209591: Drop table image_comment_temp on all wikis - https://phabricator.wikimedia.org/T209591 [08:50:27] !log Drop image_comment_temp from s7 - T209591 [08:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:58] !log roll restart of cassandra on aqs1005-1009 for opendjdk upgrades [08:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:29] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) >>! In T212189#4833482, @daniel wrote: > I agree with Joe that it would be better to have the service be internal, and be called from MW.... [08:57:25] PROBLEM - cassandra-a CQL 10.64.32.189:9042 on aqs1005 is CRITICAL: connect to address 10.64.32.189 and port 9042: Connection refused [08:58:37] RECOVERY - cassandra-a CQL 10.64.32.189:9042 on aqs1005 is OK: TCP OK - 0.000 second response time on 10.64.32.189 port 9042 [09:00:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, adding a couple of folks for information" [puppet] - 10https://gerrit.wikimedia.org/r/480664 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [09:03:23] (03PS1) 10Marostegui: Revert "install_server: Allow reimage db2057" [puppet] - 10https://gerrit.wikimedia.org/r/480707 [09:03:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/480259 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [09:04:14] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage db2057" [puppet] - 10https://gerrit.wikimedia.org/r/480707 (owner: 10Marostegui) [09:10:21] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) > Well, I consider calling the MW api from a service called by MediaWiki an antipattern that we should absolutely avoid. Oh, I got th... [09:14:24] !log rebooting restbase2008 for kernel security update [09:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] ircecho: Convert script to python3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [09:24:47] !log draining restbase2009 for eventual reboot for kernel security update [09:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:59] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Fix prometheus elasticsearch exporter to show all the metrics - https://phabricator.wikimedia.org/T210592 (10Mathew.onipe) Pull request is ready: https://github.com/justwatchcom/elasticsearch_exporter/pull/209 [09:37:00] !log dropping tables with 'T211544' prefix on db1122 - T211544 [09:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:05] T211544: Drop FlaggedRevs tables in database for ptwikipedia - https://phabricator.wikimedia.org/T211544 [09:41:15] !log draining restbase2010 for eventual reboot for kernel security update [09:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:29] (03CR) 10Hashar: "The server is being decommissioned and pending disk wipe ( T209642 ) :) So that can be merged at anytime." [puppet] - 10https://gerrit.wikimedia.org/r/480546 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [09:44:11] (03PS2) 10Muehlenhoff: admin: remove CI sudo rule for "nodepool" [puppet] - 10https://gerrit.wikimedia.org/r/480546 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [09:45:03] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:45:56] (03CR) 10Muehlenhoff: [C: 03+2] admin: remove CI sudo rule for "nodepool" [puppet] - 10https://gerrit.wikimedia.org/r/480546 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [09:46:02] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get) [09:46:02] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:02] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:02] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [09:46:02] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [09:46:02] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:03] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [09:46:14] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:14] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [09:46:18] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:20] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test r [09:46:20] read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) [09:46:22] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [09:46:22] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [09:46:22] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:26] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [09:46:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:30] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:38] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:40] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:40] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:40] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:40] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:41] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=t [09:46:41] unexpected status 429 (expecting: 200) [09:46:46] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get) [09:46:49] (03PS6) 10Alexandros Kosiaris: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [09:46:50] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:50] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [09:46:54] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:58] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:58] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:46:58] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:47:37] (03PS7) 10Alexandros Kosiaris: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [09:48:06] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] blubberoid: Bump CPU limit to 1800m [deployment-charts] - 10https://gerrit.wikimedia.org/r/480484 (owner: 10Alexandros Kosiaris) [09:48:33] (03CR) 10jerkins-bot: [V: 04-1] Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [09:48:42] there must be some problem with AQS [09:49:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] ircecho: skip message if unable to decode it [puppet] - 10https://gerrit.wikimedia.org/r/480509 (owner: 10Volans) [09:49:26] (03PS1) 10Muehlenhoff: Remove Diamond from remaining DB roles [puppet] - 10https://gerrit.wikimedia.org/r/480710 (https://phabricator.wikimedia.org/T212231) [09:49:33] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [09:49:54] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:50:48] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [09:50:52] for some reason, aqs100[89] are seeing some instances down (via nodetool status) [09:50:52] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [09:50:52] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [09:51:00] this is after a rolling restart [09:51:00] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [09:51:00] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [09:51:01] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy [09:51:02] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [09:51:02] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [09:51:02] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [09:51:04] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy [09:51:04] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [09:51:04] the other ones are fine [09:51:04] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [09:51:08] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [09:51:12] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [09:51:12] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [09:51:14] and now all good [09:51:14] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [09:51:18] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [09:51:26] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [09:51:26] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [09:51:26] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [09:51:26] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [09:51:26] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [09:51:28] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [09:51:28] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [09:51:38] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [09:51:38] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [09:51:38] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [09:51:38] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy [09:51:39] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy [09:51:42] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [09:51:52] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [09:51:52] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [09:53:01] (03CR) 10Hashar: "Great :) It is also run automatically by the CI job / rake test:" [puppet] - 10https://gerrit.wikimedia.org/r/480587 (owner: 10Hashar) [09:53:38] !log dropping tables 'flagged%' on db1066 ptwiki with replication enabled - T211544 [09:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:41] T211544: Drop FlaggedRevs tables in database for ptwikipedia - https://phabricator.wikimedia.org/T211544 [09:55:42] the only weirdness remains aqs1009-a [09:55:56] nodetool-a still reports a lot of instances as down [09:55:58] mmmmm [09:56:22] 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10hashar) With [[ https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480573/ | Gerrit #480573 ]] doc1001.eqiad.wmn... [09:56:48] (03CR) 10Fsero: [C: 03+2] buster package modified to customize it for WMF and for build 2.7 [debs/docker-distribution] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/475792 (https://phabricator.wikimedia.org/T210071) (owner: 10Fsero) [09:58:23] (03PS1) 10Marostegui: Revert "db2057: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/480712 [09:58:31] (03PS2) 10Marostegui: Revert "db2057: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/480712 [09:59:05] (03PS1) 10Muehlenhoff: Remove Hiera file obsoleted by nodepool removal [puppet] - 10https://gerrit.wikimedia.org/r/480713 (https://phabricator.wikimedia.org/T209642) [10:00:43] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [10:01:27] (03CR) 10Hashar: [C: 03+1] "Indeed the wmcs::openstack::main::nodepool profile is gone. Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/480713 (https://phabricator.wikimedia.org/T209642) (owner: 10Muehlenhoff) [10:02:08] aqs1009-a now completed the handshake with all the other instances, it only took minutes [10:02:11] really weird [10:02:59] (03PS8) 10Alexandros Kosiaris: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [10:04:32] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php::monitoring: fine-grained opcache invalidation [puppet] - 10https://gerrit.wikimedia.org/r/480714 (https://phabricator.wikimedia.org/T211964) [10:05:54] (03PS1) 10Hashar: doc: set cluster and notification groups [puppet] - 10https://gerrit.wikimedia.org/r/480715 (https://phabricator.wikimedia.org/T211974) [10:05:56] (03PS2) 10Muehlenhoff: Remove Hiera file obsoleted by nodepool removal [puppet] - 10https://gerrit.wikimedia.org/r/480713 (https://phabricator.wikimedia.org/T209642) [10:06:57] (03PS13) 10Elukey: [WIP] Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [10:07:02] (03CR) 10Muehlenhoff: [C: 03+2] Remove Hiera file obsoleted by nodepool removal [puppet] - 10https://gerrit.wikimedia.org/r/480713 (https://phabricator.wikimedia.org/T209642) (owner: 10Muehlenhoff) [10:10:20] (03CR) 10Elukey: "Moved from 'root' to 'oozie' in the oozie::server's exec, pcc still looks good https://puppet-compiler.wmflabs.org/compiler1002/14006/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [10:10:44] (03PS1) 10Hashar: doc: grant doc-uploader access to contint users [puppet] - 10https://gerrit.wikimedia.org/r/480716 (https://phabricator.wikimedia.org/T211974) [10:13:21] (03CR) 10Hashar: "cluster: ci , I think that is for monitoring purpose. Filippo would know for sure. Since doc is used by CI, it seems legit to add it und" [puppet] - 10https://gerrit.wikimedia.org/r/480715 (https://phabricator.wikimedia.org/T211974) (owner: 10Hashar) [10:14:42] !log draining restbase2011 for eventual reboot for kernel security update [10:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:33] (03CR) 10Hashar: "We can definitely use a shell access to lurk at /srv/org/wikimedia, might as well give us access to the doc-publisher unix account in case" [puppet] - 10https://gerrit.wikimedia.org/r/480716 (https://phabricator.wikimedia.org/T211974) (owner: 10Hashar) [10:18:02] (03CR) 10Hashar: "The ci cluster has been introduced recently by https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/479845/ (thank you for that!)" [puppet] - 10https://gerrit.wikimedia.org/r/480715 (https://phabricator.wikimedia.org/T211974) (owner: 10Hashar) [10:25:57] !log stopping replication on db2073 as executing schema change on codfw master - T85757 [10:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:00] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:26:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, worth a review by Brooke as well" [puppet] - 10https://gerrit.wikimedia.org/r/480590 (https://phabricator.wikimedia.org/T212254) (owner: 10Ladsgroup) [10:28:21] !log executing schema change in db2051 (s4 codfw master) with replication enabled - T85757 [10:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:05] (03CR) 10Ema: [C: 03+1] remote: add more functionalities (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/480064 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [10:37:42] !log draining restbase2012 for eventual reboot for kernel security update [10:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:50] PROBLEM - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [10:42:44] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1158 bytes in 0.065 second response time [10:43:04] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1168 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [10:44:16] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [10:47:11] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480721 [10:48:42] (03CR) 10Ema: [C: 03+1] "Minor comment, other than that +1" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/480485 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [10:50:34] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) I'm not sure if we will be able to do this operation before the end of year holidays. @ayounsi Could we open a windo... [10:52:51] (03CR) 10Ema: [C: 03+1] sre.hosts: add upgrade and reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/480072 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [10:53:33] (03CR) 10Volans: [C: 03+2] remote: add more functionalities [software/spicerack] - 10https://gerrit.wikimedia.org/r/480064 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [10:58:41] (03Merged) 10jenkins-bot: remote: add more functionalities [software/spicerack] - 10https://gerrit.wikimedia.org/r/480064 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [11:02:13] jouncebot: next [11:02:13] In 0 hour(s) and 57 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181219T1200) [11:03:41] (03PS2) 10Volans: interactive: check TTY in ask_confirmation() [software/spicerack] - 10https://gerrit.wikimedia.org/r/480485 (https://phabricator.wikimedia.org/T205884) [11:04:10] (03CR) 10Volans: interactive: check TTY in ask_confirmation() (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/480485 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [11:09:18] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) a:03Addshore [11:11:17] !log Drop image_comment_temp from s5 - T209591 [11:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:20] T209591: Drop table image_comment_temp on all wikis - https://phabricator.wikimedia.org/T209591 [11:12:57] (03PS4) 10Volans: README: move API documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/477565 (https://phabricator.wikimedia.org/T199079) [11:13:07] !log Stop MySQL and power off db2057 for firmware upgrade - T212277 [11:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:10] T212277: Upgrade db2057 firmware - https://phabricator.wikimedia.org/T212277 [11:14:06] (03PS4) 10Volans: sre.hosts: add upgrade and reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480072 (https://phabricator.wikimedia.org/T205886) [11:14:40] (03CR) 10Volans: sre.hosts: add upgrade and reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/480072 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [11:15:11] 10Operations, 10ops-codfw, 10DBA: Upgrade db2057 firmware - https://phabricator.wikimedia.org/T212277 (10Marostegui) @papaul server is powered off, so you can proceed whenever you can. Once you are done, power it on and we will start MySQL and repool it Thanks! [11:15:21] !log depooling db1084 for schema change - T85757 [11:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:24] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:15:42] (03CR) 10Banyek: [C: 03+2] mariadb: depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479634 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [11:15:49] (03PS3) 10Banyek: mariadb: depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479634 (https://phabricator.wikimedia.org/T85757) [11:16:14] (03PS1) 10Volans: doc: add documentation and its generation [software/spicerack] - 10https://gerrit.wikimedia.org/r/480724 (https://phabricator.wikimedia.org/T205894) [11:16:49] (03PS6) 10Ema: sre.hosts: add varnish upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) [11:20:55] (03CR) 10jenkins-bot: mariadb: depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479634 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [11:22:19] (03CR) 10Ema: sre.hosts: add varnish upgrade cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema) [11:23:03] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480721 (owner: 10Marostegui) [11:23:32] (03CR) 10Ema: Clarify expected format of service name in wmf-auto-restart (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480520 (https://phabricator.wikimedia.org/T212219) (owner: 10Muehlenhoff) [11:24:09] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480721 (owner: 10Marostegui) [11:25:04] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: depool db1084 for schema change - T85757 (duration: 00m 52s) [11:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:07] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:25:59] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2050 after recloning db2057 T212275 (duration: 00m 52s) [11:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:02] T212275: db2057 storage crashed - https://phabricator.wikimedia.org/T212275 [11:29:03] (03CR) 10Muehlenhoff: Clarify expected format of service name in wmf-auto-restart (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480520 (https://phabricator.wikimedia.org/T212219) (owner: 10Muehlenhoff) [11:29:51] (03PS1) 10Banyek: Revert "mariadb: depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480726 [11:29:55] !log upgrading nodejs on restbase2013-2018 [11:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:12] !log repooling db1084 after schema change - T85757 [11:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:15] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:31:51] (03CR) 10Banyek: [C: 03+2] Revert "mariadb: depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480726 (owner: 10Banyek) [11:32:32] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480726 (owner: 10Banyek) [11:32:34] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480721 (owner: 10Marostegui) [11:32:36] (03CR) 10Mobrovac: [C: 04-1] Configure cxserver ratelimiter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [11:32:47] (03CR) 10jenkins-bot: Revert "mariadb: depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480726 (owner: 10Banyek) [11:34:26] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: repool db1084 after schema change - T85757 (duration: 00m 51s) [11:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:33] !log depooling db1091 for schema change - T85757 [11:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:36] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:37:43] (03PS2) 10Banyek: mariadb: depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479638 (https://phabricator.wikimedia.org/T85757) [11:38:04] (03CR) 10Banyek: [C: 03+2] mariadb: depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479638 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [11:39:18] (03Merged) 10jenkins-bot: mariadb: depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479638 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [11:41:04] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: depool db1091 for schema change - T85757 (duration: 00m 52s) [11:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:48] (03PS2) 10Elukey: Add two new HDFS journalnodes to the Analytics Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/478623 (https://phabricator.wikimedia.org/T209929) [11:44:55] (03CR) 10jenkins-bot: mariadb: depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479638 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [11:46:01] !log repooling db1091 after schema change - T85757 [11:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:04] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:46:12] (03PS1) 10Banyek: Revert "mariadb: depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480727 [11:47:29] (03CR) 10Banyek: [C: 03+2] Revert "mariadb: depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480727 (owner: 10Banyek) [11:48:33] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480727 (owner: 10Banyek) [11:49:45] !log rebooting matomo1001 to pick up SSBD-enabled qemu [11:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:06] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: repool db1091 after schema change - T85757 (duration: 00m 52s) [11:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:46] (03PS1) 10Ema: cache_canary LVS service [puppet] - 10https://gerrit.wikimedia.org/r/480728 (https://phabricator.wikimedia.org/T202966) [11:55:47] (03PS9) 10Alexandros Kosiaris: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [11:57:37] (03CR) 10jenkins-bot: Revert "mariadb: depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480727 (owner: 10Banyek) [11:57:45] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Drop valid_tag and tag_summary from labs replicas [puppet] - 10https://gerrit.wikimedia.org/r/480590 (https://phabricator.wikimedia.org/T212254) (owner: 10Ladsgroup) [11:58:25] !log rebooting netmon2001 for kernel security update [11:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:42] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:59:15] I'll check es1007 [11:59:53] (1017) [11:59:57] (03PS2) 10Ema: cache_canary LVS service [puppet] - 10https://gerrit.wikimedia.org/r/480728 (https://phabricator.wikimedia.org/T202966) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181219T1200). [12:00:04] mdholloway: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:21] o/ [12:00:31] I can SWAT today [12:01:06] mdholloway: the two patches you have scheduled should be deployed one by one? or together? [12:01:14] zeljkof: together [12:01:18] (03CR) 10Alexandros Kosiaris: "Thanks. I think I 've addressed both comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [12:01:23] I got confused with "revert revert revert" :) [12:01:28] RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:01:43] zeljkof: yeah, sorry, wasn't sure of the best way to go about that [12:01:47] first that one, then the next [12:01:55] (but together) [12:02:07] mdholloway: can you combine then in one commit? or should it be two commit? [12:02:21] zeljkof: it could go into a single commit [12:02:46] i'll do that quickly [12:02:59] mdholloway: I would prefer a single commit, you can still reference that it partially reverts something in the commit message [12:04:08] zeljkof: actually, i believe the other one should work on its own, i'll just abandon the "revert revert revert" patch [12:04:57] mdholloway: ok, so I should just merge and deploy 480687? [12:05:06] zeljkof: yes, please [12:05:44] mdholloway: ok, CI might take 10-20 minutes, I'll let you one when it's ready for testing at mwdebug1002 [12:05:55] zeljkof: sounds good, i'll be here [12:06:08] !log rearmed keyholder after netmon2001 reboot [12:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:14] mdholloway: CI was faster than expected, 480687 is merged :) it will be at mwdebug in a few minutes [12:17:09] zeljkof: great! [12:19:23] mdholloway: it's at mwdebug1002, please test and let me know if I can deploy it [12:19:32] zeljkof: ok, checking [12:20:18] zeljkof: looks good! [12:21:29] (confirmed that https://test.wikipedia.org/w/api.php?format=json&formatversion=2&action=query&titles=Mavetuna&prop=mapdata&mpdlimit=max&mpdgroups=_51b13e17bbf0535a422e47419d202fb1f632b849 now responds successfully) [12:21:42] (on mwdebug1002) [12:22:07] mdholloway: ok, deploying [12:23:05] !log zfilipin@deploy1001 Synchronized php-1.33.0-wmf.9/extensions/Kartographer/: SWAT: [[gerrit:480687|Fix using at-ease functions in namespaced class (T212218)]] (duration: 00m 53s) [12:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:08] T212218: Fatal error: Call to undefined function Kartographer\Wikimedia\suppressWarnings() in /srv/mediawiki/php-1.33.0-wmf.9/extensions/Kartographer/includes/ApiQueryMapData.php on line 49 - https://phabricator.wikimedia.org/T212218 [12:23:13] mdholloway: it's deployed [12:23:41] (03PS10) 10Alexandros Kosiaris: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [12:23:54] (03PS1) 10Fsero: Introducing DNS entries for new docker-registries [dns] - 10https://gerrit.wikimedia.org/r/480732 (https://phabricator.wikimedia.org/T212212) [12:23:55] zeljkof: great, thank you! i'll monitor this for a bit and then resolve the related tasks [12:24:07] (03CR) 10jerkins-bot: [V: 04-1] Introducing DNS entries for new docker-registries [dns] - 10https://gerrit.wikimedia.org/r/480732 (https://phabricator.wikimedia.org/T212212) (owner: 10Fsero) [12:24:23] mdholloway: I was just about to ask if it resolves any/both Kartographer related train blockers [12:24:39] !log EU SWAT finished [12:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:41] zeljkof: well, just to confirm: is wmf.8 actually blocked on that error not appearing? because it's not a new error (see T184128) [12:26:42] T184128: "PHP Warning: data error" from gzdecode() in ApiGraph.php and ApiQueryMapData.php - https://phabricator.wikimedia.org/T184128 [12:27:22] there was the initial attempt at suppressing the errors which made it into wmf.9 but was never backported into wmf.8 [12:27:44] and now the fix to that initial attempt which was just SWATted into wmf.9 [12:28:14] mdholloway: wmf.9 is blocked (not going forward) until there are any open sub-tasks [12:28:32] if you think a task is not a blocker, feel free to comment in phab and remove from train sub-tasks [12:28:34] (03PS11) 10Alexandros Kosiaris: Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [12:28:55] zeljkof: understood, i'll comment on-task. [12:28:57] it's hard for me to know all the details, I'm just trying to deploy new code and not break anything :) [12:31:41] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Uh, how did this happen? Which patch added this "s"?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) (owner: 10Lucas Werkmeister (WMDE)) [12:32:59] (03CR) 10Mobrovac: [C: 03+1] Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [12:34:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] Configure cxserver ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/480481 (owner: 10KartikMistry) [12:35:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10User-jijiki: Requesting access to deployment for Christoph Jauera (WMDE-Fisch) - https://phabricator.wikimedia.org/T211014 (10WMDE-Fisch) >>! In T211014#4812240, @Dzahn wrote: > @WMDE-Fisch Your a... [12:37:19] (03PS8) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794 [12:37:22] (03CR) 10Paladox: ircecho: Convert script to python3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [12:38:45] (03PS2) 10Fsero: Introducing DNS entries for new docker-registries [dns] - 10https://gerrit.wikimedia.org/r/480732 (https://phabricator.wikimedia.org/T212212) [12:46:40] !log Drop image_comment_temp on s3 - T209591 [12:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:43] T209591: Drop table image_comment_temp on all wikis - https://phabricator.wikimedia.org/T209591 [12:58:47] (03CR) 10Ema: Clarify expected format of service name in wmf-auto-restart (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480520 (https://phabricator.wikimedia.org/T212219) (owner: 10Muehlenhoff) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181219T1300) [13:08:46] (03PS1) 10Elukey: druid: reserve middlemanager ports from 8200 onward [puppet] - 10https://gerrit.wikimedia.org/r/480733 (https://phabricator.wikimedia.org/T204979) [13:09:11] (03PS9) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794 [13:26:33] (03PS1) 10Volans: zone_validator: handle fully qualified pointers [dns] - 10https://gerrit.wikimedia.org/r/480743 [13:31:06] !log Drop image_comment_temp on s4 - T209591 [13:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:09] T209591: Drop table image_comment_temp on all wikis - https://phabricator.wikimedia.org/T209591 [13:32:59] (03CR) 10BBlack: [C: 03+1] "Tested with the offending commit, works!" [dns] - 10https://gerrit.wikimedia.org/r/480743 (owner: 10Volans) [13:34:31] (03CR) 10Volans: [C: 03+2] zone_validator: handle fully qualified pointers [dns] - 10https://gerrit.wikimedia.org/r/480743 (owner: 10Volans) [13:35:44] !log Rename table valid_tag on db1081 (s1) - T212254 [13:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:47] T212254: Drop valid_tag table - https://phabricator.wikimedia.org/T212254 [13:36:47] !log Correction from the previous !log: Rename table valid_tag on db1089 (s1) - T212254 [13:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:10] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [13:45:20] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:46:44] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [13:47:54] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [13:57:53] !log installing nodejs updates on wtp* [13:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:04] zeljkof: (Dis)respected human, time to deploy MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181219T1400). Please do the needful. [14:00:47] o/ [14:02:06] Train blocked on T212217 [14:02:07] T212217: ErrorException from line 317 of /srv/mediawiki/php-1.33.0-wmf.9/extensions/ExtensionDistributor/includes/specials/SpecialBaseDistributor.php: PHP Notice: Undefined variable: downloadImg - https://phabricator.wikimedia.org/T212217 [14:03:14] !log draining restbase1007 for eventual reboot for kernel security update [14:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:50] (03CR) 10Elukey: [C: 03+1] Remove references to the old, decommissioned etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/479178 (owner: 10Giuseppe Lavagetto) [14:16:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove references to the old, decommissioned etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/479178 (owner: 10Giuseppe Lavagetto) [14:16:05] (03PS3) 10Giuseppe Lavagetto: Remove references to the old, decommissioned etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/479178 [14:16:21] !log draining restbase1008 for eventual reboot for kernel security update [14:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:00] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Marostegui) >>! In T200297#4829689, @awight wrote: > Here are some example queries to help with reviewing the DDL. @Marostegui,... [14:27:30] (03PS1) 10Banyek: mariadb: refactor multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/480750 [14:28:28] (03CR) 10jerkins-bot: [V: 04-1] mariadb: refactor multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/480750 (owner: 10Banyek) [14:31:03] (03CR) 10Ema: [C: 03+1] interactive: check TTY in ask_confirmation() [software/spicerack] - 10https://gerrit.wikimedia.org/r/480485 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:32:39] (03PS2) 10Banyek: mariadb: refactor multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/480750 [14:33:33] (03CR) 10jerkins-bot: [V: 04-1] mariadb: refactor multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/480750 (owner: 10Banyek) [14:33:35] (03PS5) 10Volans: README: move API documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/477565 (https://phabricator.wikimedia.org/T199079) [14:33:44] (03CR) 10Ottomata: [WIP] Add remaining kerberos wrapped commands (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [14:34:08] (03CR) 10Volans: [C: 03+2] interactive: check TTY in ask_confirmation() [software/spicerack] - 10https://gerrit.wikimedia.org/r/480485 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:34:20] (03CR) 10Ottomata: [C: 03+1] Add two new HDFS journalnodes to the Analytics Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/478623 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [14:34:50] (03CR) 10Ottomata: [C: 03+1] "Oh ho!" [puppet] - 10https://gerrit.wikimedia.org/r/480733 (https://phabricator.wikimedia.org/T204979) (owner: 10Elukey) [14:35:37] (03PS3) 10Banyek: mariadb: refactor multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/480750 [14:36:33] (03CR) 10jerkins-bot: [V: 04-1] mariadb: refactor multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/480750 (owner: 10Banyek) [14:38:55] (03Merged) 10jenkins-bot: interactive: check TTY in ask_confirmation() [software/spicerack] - 10https://gerrit.wikimedia.org/r/480485 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:39:10] (03PS4) 10Banyek: mariadb: refactor multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/480750 [14:39:17] (03CR) 10Ema: [C: 03+1] doc: add documentation and its generation [software/spicerack] - 10https://gerrit.wikimedia.org/r/480724 (https://phabricator.wikimedia.org/T205894) (owner: 10Volans) [14:40:05] (03CR) 10jerkins-bot: [V: 04-1] mariadb: refactor multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/480750 (owner: 10Banyek) [14:40:17] (03CR) 10Ema: [C: 03+1] sre.hosts: add upgrade and reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480072 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [14:40:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] Introducing DNS entries for new docker-registries [dns] - 10https://gerrit.wikimedia.org/r/480732 (https://phabricator.wikimedia.org/T212212) (owner: 10Fsero) [14:41:05] (03PS2) 10Volans: doc: add documentation and its generation [software/spicerack] - 10https://gerrit.wikimedia.org/r/480724 (https://phabricator.wikimedia.org/T205894) [14:42:24] !log draining restbase1009 for eventual reboot for kernel security update [14:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:16] (03PS6) 10Volans: README: move API documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/477565 (https://phabricator.wikimedia.org/T199079) [14:43:21] (03CR) 10Elukey: "Good suggestions, let's also hear what is Moritz's preference :)" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [14:44:52] (03CR) 10Giuseppe Lavagetto: role::beta: introduce docker_services (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/478637 (owner: 10Giuseppe Lavagetto) [14:45:25] (03CR) 10Muehlenhoff: "Agreed on moving this to the kerberos module" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [14:46:56] (03CR) 10Volans: [C: 03+2] doc: add documentation and its generation [software/spicerack] - 10https://gerrit.wikimedia.org/r/480724 (https://phabricator.wikimedia.org/T205894) (owner: 10Volans) [14:47:06] (03PS5) 10Banyek: mariadb: refactor multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/480750 [14:49:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Final comments on my end" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [14:49:41] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) Ran Timo's grafana audit script to find dashboards using remaining varnish statsd metrics, note some hits can be false positives (i.e. the... [14:52:24] (03Merged) 10jenkins-bot: doc: add documentation and its generation [software/spicerack] - 10https://gerrit.wikimedia.org/r/480724 (https://phabricator.wikimedia.org/T205894) (owner: 10Volans) [14:53:50] (03PS10) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794 [14:54:08] (03CR) 10Paladox: ircecho: Convert script to python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [14:55:29] (03PS6) 10Banyek: mariadb: refactor multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/480750 [14:58:43] !log draining restbase1010 for eventual reboot for kernel security update [14:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] "nice! Comments inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/478637 (owner: 10Giuseppe Lavagetto) [15:02:07] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Marostegui) >>! In T200297#4829689, @awight wrote: > This is a recentchanges query which filters on the same field, so only show... [15:02:41] 10Operations, 10Analytics, 10Performance-Team, 10Traffic: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Gilles) After brainstorming this more, since Nginx TLS termination is going to remain for the foreseeable future, even after we move backe... [15:03:01] (03CR) 10Alexandros Kosiaris: "missed one." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [15:03:15] 10Operations, 10Analytics, 10Performance-Team, 10Traffic: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Gilles) p:05Normal→03Low [15:03:24] (03PS11) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794 [15:03:45] (03CR) 10Paladox: ircecho: Convert script to python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [15:07:29] !log Drop image_comment_temp from labswiki and labtestwiki - T209591 [15:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:32] T209591: Drop table image_comment_temp on all wikis - https://phabricator.wikimedia.org/T209591 [15:09:21] (03PS1) 10Rush: Revert "stat: add exfat for temporary narrow and approved workflow" [puppet] - 10https://gerrit.wikimedia.org/r/480756 [15:10:56] (03Abandoned) 10Rush: Revert "stat: add exfat for temporary narrow and approved workflow" [puppet] - 10https://gerrit.wikimedia.org/r/480756 (owner: 10Rush) [15:11:09] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/480757 (https://phabricator.wikimedia.org/T205884) [15:13:11] !log draining restbase1011 for eventual reboot for kernel security update [15:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:49] (03PS12) 10Alexandros Kosiaris: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [15:14:55] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [15:15:07] (03CR) 10Paladox: "thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [15:16:15] akosiaris: I guess icinga-wm quitting is you upgrading it :) [15:17:49] hmm [15:17:58] not there yet, actually doing it [15:18:05] maybe puppet was just faster [15:18:21] then not a good sign, it didn't reconnect :) [15:18:35] we can always rollback :-) [15:19:38] ah there it is [15:19:45] and my puppet run actually did it [15:19:51] so it died on its own ? [15:20:27] ImportError: No module named 'pyinotify' [15:20:40] yeah... a race? [15:20:41] (03CR) 10jenkins-bot: doc: add documentation and its generation [software/spicerack] - 10https://gerrit.wikimedia.org/r/480724 (https://phabricator.wikimedia.org/T205894) (owner: 10Volans) [15:20:43] it's running fine now [15:20:44] (03PS1) 10Rush: stat: absent the temp exfat packages [puppet] - 10https://gerrit.wikimedia.org/r/480759 (https://phabricator.wikimedia.org/T211327) [15:20:59] I think puppet was running and fetching stuff the moment I merged the change [15:21:13] that would explain it nicely [15:21:20] (03CR) 10jerkins-bot: [V: 04-1] stat: absent the temp exfat packages [puppet] - 10https://gerrit.wikimedia.org/r/480759 (https://phabricator.wikimedia.org/T211327) (owner: 10Rush) [15:21:28] the python3 version was fetched, but the packages were not yet installed [15:21:28] diff from syslog are from Dec 19 15:15:36 [15:21:50] I am almost certain now it's a race [15:22:07] getting the new executable but not the new catalog [15:22:17] yeah could be [15:22:25] one more reason I have source => [15:22:28] I hate* [15:22:47] (03PS2) 10Rush: stat: absent the temp exfat packages [puppet] - 10https://gerrit.wikimedia.org/r/480759 (https://phabricator.wikimedia.org/T211327) [15:22:59] I used to think it was cool, but now I prefer content=> almost always [15:23:05] (03CR) 10Banyek: "https://puppet-compiler.wmflabs.org/compiler1002/14013/" [puppet] - 10https://gerrit.wikimedia.org/r/480750 (owner: 10Banyek) [15:23:12] anyway icinga-wm upgraded to python3 [15:23:15] thanks paladox! [15:23:20] your welcome :) [15:23:26] does it has the fix of yesterday? [15:23:26] (03CR) 10jerkins-bot: [V: 04-1] stat: absent the temp exfat packages [puppet] - 10https://gerrit.wikimedia.org/r/480759 (https://phabricator.wikimedia.org/T211327) (owner: 10Rush) [15:24:13] volans i think it does (didn't have a merge conflict when i rebased) [15:24:27] ack [15:24:29] thx [15:24:40] (03PS2) 10Bstorm: Drop valid_tag and tag_summary from labs replicas [puppet] - 10https://gerrit.wikimedia.org/r/480590 (https://phabricator.wikimedia.org/T212254) (owner: 10Ladsgroup) [15:24:50] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/480757 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:25:02] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10ayounsi) 17:00UTC tomorrow works for me. [15:25:19] RECOVERY - Device not healthy -SMART- on stat1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004var-datasource=eqiad%2520prometheus%252Fops [15:26:23] PROBLEM - puppet last run on ping1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:00] (03PS1) 10Paladox: ircecho: Migrate from OptionParser to ArgumentParser [puppet] - 10https://gerrit.wikimedia.org/r/480760 [15:27:49] (03PS1) 10Rush: stat: add exfat for temporary narrow and approved workflow [puppet] - 10https://gerrit.wikimedia.org/r/480761 (https://phabricator.wikimedia.org/T211327) [15:28:16] (03CR) 10Muehlenhoff: stat: absent the temp exfat packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480759 (https://phabricator.wikimedia.org/T211327) (owner: 10Rush) [15:30:33] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/480757 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:30:39] (03CR) 10Andrew Bogott: [C: 03+1] stat: add exfat for temporary narrow and approved workflow [puppet] - 10https://gerrit.wikimedia.org/r/480761 (https://phabricator.wikimedia.org/T211327) (owner: 10Rush) [15:31:32] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/480757 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:31:35] (03PS2) 10Rush: stat: add exfat for temporary narrow and approved workflow [puppet] - 10https://gerrit.wikimedia.org/r/480761 (https://phabricator.wikimedia.org/T211327) [15:32:31] (03CR) 10Rush: [C: 03+2] stat: add exfat for temporary narrow and approved workflow [puppet] - 10https://gerrit.wikimedia.org/r/480761 (https://phabricator.wikimedia.org/T211327) (owner: 10Rush) [15:32:32] 10Operations, 10ops-eqiad: restbase1011 fails to boot, ASSERT error lines - https://phabricator.wikimedia.org/T212305 (10MoritzMuehlenhoff) [15:33:50] (03PS3) 10Bstorm: Drop valid_tag and tag_summary from labs replicas [puppet] - 10https://gerrit.wikimedia.org/r/480590 (https://phabricator.wikimedia.org/T212254) (owner: 10Ladsgroup) [15:33:52] !log labstore1007 mount /dev/sde /mnt/T211327 [15:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:20] (03CR) 10Volans: [C: 03+2] README: move API documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/477565 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:35:04] (03CR) 10Bstorm: [C: 03+2] Drop valid_tag and tag_summary from labs replicas [puppet] - 10https://gerrit.wikimedia.org/r/480590 (https://phabricator.wikimedia.org/T212254) (owner: 10Ladsgroup) [15:36:22] (03Merged) 10jenkins-bot: README: move API documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/477565 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:37:37] (03CR) 10Cwhite: [C: 03+1] doc: set cluster and notification groups [puppet] - 10https://gerrit.wikimedia.org/r/480715 (https://phabricator.wikimedia.org/T211974) (owner: 10Hashar) [15:38:17] (03PS1) 10Tulsi Bhagat: Add 'suppressredirect' user right to patroller user group at zh.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480768 [15:39:06] (03PS1) 10Hashar: contint: instances are fully on eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/480769 (https://phabricator.wikimedia.org/T210288) [15:39:09] !log various cp jobs on labstore1007 to ext media [15:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:11] (03CR) 10Cwhite: [C: 03+1] Remove Diamond from remaining DB roles [puppet] - 10https://gerrit.wikimedia.org/r/480710 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [15:40:36] (03PS2) 10Volans: API: convert to new Spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/479463 (https://phabricator.wikimedia.org/T205884) [15:41:48] (03PS2) 10Muehlenhoff: Remove Diamond from remaining DB roles [puppet] - 10https://gerrit.wikimedia.org/r/480710 (https://phabricator.wikimedia.org/T212231) [15:42:47] moritzm: are you about to merge this? [15:42:58] * volans pick up the next free slot in the puppet-merge queue [15:43:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove Diamond from remaining DB roles [puppet] - 10https://gerrit.wikimedia.org/r/480710 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [15:43:52] (03PS2) 10Tulsi Bhagat: Add 'suppressredirect' user right to patroller user group at zh.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480768 (https://phabricator.wikimedia.org/T212272) [15:44:19] bstorm_: ok to merge your patch along? "Drop valid_tag and tag_summary from labs replicas" [15:44:27] volans: ack, once ^ is done [15:44:27] Yes, please [15:44:29] ok [15:45:04] volans: go ahead [15:45:22] moritzm: ack thanks [15:45:50] (03PS2) 10Volans: contint: instances are fully on eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/480769 (https://phabricator.wikimedia.org/T210288) (owner: 10Hashar) [15:46:31] (03CR) 10Hashar: [C: 04-1] "But it does not work:" [puppet] - 10https://gerrit.wikimedia.org/r/480769 (https://phabricator.wikimedia.org/T210288) (owner: 10Hashar) [15:46:42] (03PS1) 10Addshore: Wikibase: prepare to set $wgWBRepoSettings['idGenerator'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480774 (https://phabricator.wikimedia.org/T194299) [15:46:46] (03PS1) 10Addshore: BETA: Wikibase, use mysql-upsert id generator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480775 (https://phabricator.wikimedia.org/T194299) [15:47:07] (03PS1) 10Ayounsi: Depool codfw for row D recabling [dns] - 10https://gerrit.wikimedia.org/r/480776 (https://phabricator.wikimedia.org/T210467) [15:47:28] hashar: mmmh what fails? I can have a look in a bit, not immediately [15:47:37] but if you tell me that it doesn't work I'll wait to merge it [15:47:45] wanted to return the favour merging your stuff :) [15:48:07] (03CR) 10Ayounsi: [C: 03+2] Depool codfw for row D recabling [dns] - 10https://gerrit.wikimedia.org/r/480776 (https://phabricator.wikimedia.org/T210467) (owner: 10Ayounsi) [15:48:09] (03PS1) 10Addshore: Wikibase, use mysql-upsert id generator on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480777 (https://phabricator.wikimedia.org/T194299) [15:48:10] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review, 10User-jijiki: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10Papaul) fpc2-fpc8 xe-2/0/41 and xe-2/0/42 fpc7-fpc8 xe-7/0/43 and xe-7/0/44 [15:48:40] !log depool codfw - T210467 [15:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:43] T210467: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 [15:49:03] (03PS1) 10Addshore: Wikibase, use mysql-upsert on all Wikibases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480778 (https://phabricator.wikimedia.org/T194299) [15:49:30] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review, 10User-jijiki: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10Papaul) [15:49:42] (03CR) 10Hashar: [C: 04-1] "Sorry the utils/hiera_lookup queries the labs puppetmaster instead of the local puppet master :/" [puppet] - 10https://gerrit.wikimedia.org/r/480769 (https://phabricator.wikimedia.org/T210288) (owner: 10Hashar) [15:56:11] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480768 (https://phabricator.wikimedia.org/T212272) (owner: 10Tulsi Bhagat) [15:56:44] ACKNOWLEDGEMENT - Host restbase1011 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T212305 [15:57:17] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 56.3 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [15:57:31] RECOVERY - puppet last run on ping1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:57:51] zeljkof: did the train all go smoothly? :) [15:58:09] 10Operations, 10Cloud-Services, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287 (10GTirloni) The error I mentioned above is unrelated to this issue, please ignore. It's been fixed in T211202. [15:58:26] addshore: no, there's still one more blocker, train is stopped :/ [15:58:27] (03PS1) 10Mobrovac: CXServer: Pass the IP address to the service's config [puppet] - 10https://gerrit.wikimedia.org/r/480781 [15:58:41] (03PS1) 10Ayounsi: Redirect eqsin/ulsfo caches to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/480782 (https://phabricator.wikimedia.org/T210467) [15:58:51] zeljkof: ack [15:59:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] CXServer: Pass the IP address to the service's config [puppet] - 10https://gerrit.wikimedia.org/r/480781 (owner: 10Mobrovac) [16:00:14] zeljkof: it looks like itis merged on master so https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ExtensionDistributor/+/480578/ can probably just be merged on the branch! [16:01:33] !log installing php5 security updates on jessie [16:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:03] (03CR) 10Ayounsi: [C: 03+2] Redirect eqsin/ulsfo caches to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/480782 (https://phabricator.wikimedia.org/T210467) (owner: 10Ayounsi) [16:02:11] (03PS2) 10Ayounsi: Redirect eqsin/ulsfo caches to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/480782 (https://phabricator.wikimedia.org/T210467) [16:02:28] 10Operations, 10Traffic: varnishreqstats sends truncated statsd traffic - https://phabricator.wikimedia.org/T212310 (10fgiunchedi) [16:02:37] PROBLEM - Juniper virtual chassis ports on asw-d-codfw is CRITICAL: CRIT: Down: 3 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [16:04:17] expected ^ [16:04:23] !log Redirect eqsin/ulsfo caches to eqiad - T210467 [16:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:26] T210467: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 [16:06:53] 10Operations, 10ops-codfw, 10DBA: Upgrade db2057 firmware - https://phabricator.wikimedia.org/T212277 (10Papaul) a:05Papaul→03Marostegui Firmware upgrade complete [16:08:27] 10Operations, 10ops-codfw, 10DBA: Upgrade db2057 firmware - https://phabricator.wikimedia.org/T212277 (10Marostegui) Thank you - I will take it from here! [16:10:17] (03PS2) 10Paladox: ircecho: Migrate from OptionParser to ArgumentParser [puppet] - 10https://gerrit.wikimedia.org/r/480760 [16:10:47] (03PS3) 10Paladox: ircecho: Migrate from OptionParser to ArgumentParser [puppet] - 10https://gerrit.wikimedia.org/r/480760 [16:11:29] (03CR) 10jerkins-bot: [V: 04-1] ircecho: Migrate from OptionParser to ArgumentParser [puppet] - 10https://gerrit.wikimedia.org/r/480760 (owner: 10Paladox) [16:13:37] (03PS4) 10Paladox: ircecho: Migrate from OptionParser to ArgumentParser [puppet] - 10https://gerrit.wikimedia.org/r/480760 [16:15:29] (03PS5) 10Paladox: ircecho: Migrate from OptionParser to ArgumentParser [puppet] - 10https://gerrit.wikimedia.org/r/480760 [16:16:12] mobrovac: ipaddress: 127.0.0.1 in vars.yaml is fine at, https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/deploy/+/479589/ ? [16:16:14] (03PS1) 10Cmjohnson: Adding mgmt dns cloudvirt1025-30 [dns] - 10https://gerrit.wikimedia.org/r/480786 (https://phabricator.wikimedia.org/T209616) [16:16:33] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review, 10User-jijiki: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10ayounsi) [16:16:40] (03PS1) 10Herron: logstash::collector: pull logs from both kafka-logging clusters [puppet] - 10https://gerrit.wikimedia.org/r/480787 (https://phabricator.wikimedia.org/T205849) [16:17:41] (03CR) 10jerkins-bot: [V: 04-1] logstash::collector: pull logs from both kafka-logging clusters [puppet] - 10https://gerrit.wikimedia.org/r/480787 (https://phabricator.wikimedia.org/T205849) (owner: 10Herron) [16:18:04] 10Operations, 10Traffic, 10monitoring: prometheus-based graph significantly slower than statsd equivalent - https://phabricator.wikimedia.org/T212312 (10ema) [16:18:16] ah. pressed enter :) [16:18:17] 10Operations, 10Traffic, 10monitoring: prometheus-based graph significantly slower than statsd equivalent - https://phabricator.wikimedia.org/T212312 (10ema) p:05Triage→03Normal [16:18:39] (03PS1) 10Anomie: Set comment migration stage to new everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480788 (https://phabricator.wikimedia.org/T166733) [16:19:07] addshore: probably, I really don't like to self-merge code I know nothing about :/ [16:19:27] (03CR) 10Anomie: [C: 03+2] "Deploying config change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480788 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:19:43] (03PS2) 10Herron: logstash::collector: pull logs from both kafka-logging clusters [puppet] - 10https://gerrit.wikimedia.org/r/480787 (https://phabricator.wikimedia.org/T205849) [16:20:01] (03CR) 10Andrew Bogott: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/480786 (https://phabricator.wikimedia.org/T209616) (owner: 10Cmjohnson) [16:20:08] (03CR) 10Ottomata: [C: 03+1] "I think moving using this from kerberos module is aesthetically correct, but has a bunch of other worms in the can (submodules, wmf puppet" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [16:20:37] (03Merged) 10jenkins-bot: Set comment migration stage to new everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480788 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:20:42] (03PS1) 10Paladox: ircecho: Drop sysvinit support [puppet] - 10https://gerrit.wikimedia.org/r/480789 [16:21:48] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting comment migration to new on group 2 (T166733) (duration: 00m 52s) [16:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:51] T166733: Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733 [16:23:22] (03PS2) 10Paladox: ircecho: Drop sysvinit support [puppet] - 10https://gerrit.wikimedia.org/r/480789 [16:24:00] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/480789 (owner: 10Paladox) [16:26:56] (03CR) 10jenkins-bot: Set comment migration stage to new everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480788 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:29:01] (03PS14) 10Elukey: Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [16:29:21] RECOVERY - Juniper virtual chassis ports on asw-d-codfw is OK: OK: UP: 28 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [16:29:25] (03PS1) 10Herron: add default vlaue for kafka_shipper::kafka_brokers [puppet] - 10https://gerrit.wikimedia.org/r/480790 [16:30:14] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/14015/" [puppet] - 10https://gerrit.wikimedia.org/r/480787 (https://phabricator.wikimedia.org/T205849) (owner: 10Herron) [16:33:48] !log shutdown asw-d4-codfw - T210467 [16:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:51] T210467: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 [16:39:07] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review, 10User-jijiki: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10ayounsi) [16:39:34] bblack: note that row D recabling caused 0 issue [16:39:48] !log swapping disk in slot 2 on db1072 [16:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:14] (03CR) 10Dzahn: [C: 03+2] doc: set cluster and notification groups [puppet] - 10https://gerrit.wikimedia.org/r/480715 (https://phabricator.wikimedia.org/T211974) (owner: 10Hashar) [16:41:22] (03PS2) 10Dzahn: doc: set cluster and notification groups [puppet] - 10https://gerrit.wikimedia.org/r/480715 (https://phabricator.wikimedia.org/T211974) (owner: 10Hashar) [16:43:01] PROBLEM - IPsec on mc1033 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2033_v4 [16:43:07] (03PS1) 10Herron: logstash::collector add input identifier tags [puppet] - 10https://gerrit.wikimedia.org/r/480791 (https://phabricator.wikimedia.org/T205849) [16:44:21] PROBLEM - Host lvs2010 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:23] PROBLEM - Host elastic2052 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:29] RECOVERY - Host lvs2010 is UP: PING WARNING - Packet loss = 28%, RTA = 36.12 ms [16:44:31] RECOVERY - Host elastic2052 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [16:46:41] RECOVERY - IPsec on mc1033 is OK: Strongswan OK - 1 ESP OK [16:48:50] deploying an hotfix for ExtensionDistributor [16:49:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10Cmjohnson) I know that these say 10G but all 4 nics are standard rj45....granted 2 say 10G and 2 say 1G...kind of confusing. I p... [16:51:11] (03PS2) 10Reedy: Re-enable EP namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478473 (https://phabricator.wikimedia.org/T211494) [16:51:34] jouncebot: now [16:51:34] No deployments scheduled for the next 0 hour(s) and 8 minute(s) [16:51:38] jouncebot: next [16:51:38] In 0 hour(s) and 8 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181219T1700) [16:52:40] !log codfw row D maintenance finished without issues - T210467 [16:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:44] T210467: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 [16:53:13] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review, 10User-jijiki: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10ayounsi) [16:53:15] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [16:55:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Adding mgmt dns cloudvirt1025-30 [dns] - 10https://gerrit.wikimedia.org/r/480786 (https://phabricator.wikimedia.org/T209616) (owner: 10Cmjohnson) [16:55:34] (03CR) 10Fsero: [C: 03+2] Introducing DNS entries for new docker-registries [dns] - 10https://gerrit.wikimedia.org/r/480732 (https://phabricator.wikimedia.org/T212212) (owner: 10Fsero) [16:55:54] (03PS3) 10Fsero: Introducing DNS entries for new docker-registries [dns] - 10https://gerrit.wikimedia.org/r/480732 (https://phabricator.wikimedia.org/T212212) [16:56:02] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205867 (10Volans) [16:56:17] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205867 (10Volans) [16:56:23] (03CR) 10Fsero: [V: 03+2 C: 03+2] Introducing DNS entries for new docker-registries [dns] - 10https://gerrit.wikimedia.org/r/480732 (https://phabricator.wikimedia.org/T212212) (owner: 10Fsero) [16:56:59] deplying the hotifx right now [16:57:36] !log hashar@deploy1001 Synchronized php-1.33.0-wmf.9/extensions/ExtensionDistributor/includes/specials/SpecialBaseDistributor.php: Follow-up f686d348: No need for an tag any more - T212217 (duration: 00m 52s) [16:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:39] T212217: ErrorException from line 317 of /srv/mediawiki/php-1.33.0-wmf.9/extensions/ExtensionDistributor/includes/specials/SpecialBaseDistributor.php: PHP Notice: Undefined variable: downloadImg - https://phabricator.wikimedia.org/T212217 [16:57:50] hashar: =o [16:57:55] am I okay to swat in the next hour? [16:58:00] (03PS1) 10Herron: rsyslog::kafka_shipper: set rsyslog MaxMessageSize to 64k [puppet] - 10https://gerrit.wikimedia.org/r/480793 (https://phabricator.wikimedia.org/T205849) [16:58:07] !log DNS: updating wmnet to include new registries T212212 [16:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:10] T212212: eqiad: 1-2 VM requests for docker-registry-beta.wikimedia.org - https://phabricator.wikimedia.org/T212212 [16:58:19] zeljkof: done :) [16:58:45] (03PS2) 10Herron: rsyslog::kafka_shipper: set rsyslog MaxMessageSize to 64k [puppet] - 10https://gerrit.wikimedia.org/r/480793 (https://phabricator.wikimedia.org/T205849) [16:59:21] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [16:59:29] !log deactive BGP sessions to telia on cr1-codfw - T211715 [16:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:32] T211715: Interface errors on cr1-codfw:xe-5/3/1 - https://phabricator.wikimedia.org/T211715 [16:59:55] addshore: I am done deploying [16:59:59] sweeet [17:00:03] jouncebot: next [17:00:04] In 2 hour(s) and 59 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181219T2000) [17:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181219T1700). [17:00:04] Addshore: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:07] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 73.43 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [17:00:08] we will do the train later tonight [17:00:10] (03CR) 10Addshore: [C: 03+2] Wikibase: prepare to set $wgWBRepoSettings['idGenerator'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480774 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:00:12] (03CR) 10Addshore: [C: 03+2] BETA: Wikibase, use mysql-upsert id generator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480775 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:00:17] hashar: okay! [17:00:17] using the american slot (20:00 UTC iirc) [17:00:25] the joy of having 2 slots [17:01:22] (03Merged) 10jenkins-bot: Wikibase: prepare to set $wgWBRepoSettings['idGenerator'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480774 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:01:25] (03Merged) 10jenkins-bot: BETA: Wikibase, use mysql-upsert id generator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480775 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:02:34] XioNoX: nice :) [17:04:13] (03CR) 10jenkins-bot: Wikibase: prepare to set $wgWBRepoSettings['idGenerator'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480774 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:04:16] (03CR) 10jenkins-bot: BETA: Wikibase, use mysql-upsert id generator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480775 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:04:40] !log addshore@deploy1001 Synchronized wmf-config: Wikibase: prepare to set $wgWBRepoSettings idGenerator, T194299 (duration: 00m 53s) [17:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:43] T194299: Lock wait timeout exceeded in SqlIdGenerator::generateNewId - https://phabricator.wikimedia.org/T194299 [17:05:12] (03PS2) 10Addshore: Wikibase, use mysql-upsert id generator on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480777 (https://phabricator.wikimedia.org/T194299) [17:05:18] !log remove 2nd port to AS8220 (cf. email to peering@) [17:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:24] (03CR) 10Addshore: [C: 03+2] Wikibase, use mysql-upsert id generator on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480777 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:06:27] PROBLEM - Host lvs2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:06:29] (03Merged) 10jenkins-bot: Wikibase, use mysql-upsert id generator on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480777 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:06:33] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:06:45] PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:06:55] PROBLEM - Host elastic2051.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:06:55] PROBLEM - Host elastic2052.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:06:55] PROBLEM - Host elastic2050.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:06:56] what [17:07:13] papaul: are you doing something with mgmt in row D ? [17:07:25] PROBLEM - Host ms-be2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:07:25] PROBLEM - Host ms-be2024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:07:27] PROBLEM - Host ms-be2023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:07:31] PROBLEM - Host ms-be2038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:07:46] note that this is only mgmt so shouldn't be production impacting [17:08:01] PROBLEM - Host ms-be2037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:08:35] PROBLEM - Host ms-be2043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:08:35] PROBLEM - Host ms-fe2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:08:37] PROBLEM - Host backup2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:08:45] PROBLEM - Host cp2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:09:57] PROBLEM - Host cp2021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:10:13] PROBLEM - Host cp2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:10:13] PROBLEM - Host cp2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:13:31] !log addshore@deploy1001 Synchronized wmf-config: Wikibase: testwikidatawiki upsert idGenerator, T194299 (duration: 00m 52s) [17:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:34] T194299: Lock wait timeout exceeded in SqlIdGenerator::generateNewId - https://phabricator.wikimedia.org/T194299 [17:14:27] RECOVERY - Host ps1-d2-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.29 ms [17:14:37] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [17:14:49] RECOVERY - Host cp2019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [17:15:06] 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10hashar) The VM is working and the basic service is there ( rsyncd ). I will complete the service implementation via... [17:15:14] (03CR) 10Addshore: [C: 03+2] Wikibase, use mysql-upsert on all Wikibases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480778 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:15:15] RECOVERY - Host cp2021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms [17:15:20] (03PS2) 10Addshore: Wikibase, use mysql-upsert on all Wikibases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480778 (https://phabricator.wikimedia.org/T194299) [17:15:23] (03CR) 10Addshore: [C: 03+2] Wikibase, use mysql-upsert on all Wikibases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480778 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:15:31] RECOVERY - Host cp2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.74 ms [17:15:45] RECOVERY - Host ms-be2024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.91 ms [17:16:36] (03Merged) 10jenkins-bot: Wikibase, use mysql-upsert on all Wikibases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480778 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:16:47] (03CR) 10jenkins-bot: Wikibase, use mysql-upsert id generator on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480777 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:16:49] (03CR) 10jenkins-bot: Wikibase, use mysql-upsert on all Wikibases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480778 (https://phabricator.wikimedia.org/T194299) (owner: 10Addshore) [17:17:13] RECOVERY - Host lvs2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms [17:17:39] RECOVERY - Host elastic2052.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.07 ms [17:17:39] RECOVERY - Host elastic2051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms [17:17:39] RECOVERY - Host elastic2050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.94 ms [17:18:09] RECOVERY - Host ms-be2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [17:18:11] RECOVERY - Host ms-be2023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.63 ms [17:18:15] RECOVERY - Host ms-be2038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.64 ms [17:18:49] RECOVERY - Host ms-be2037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.82 ms [17:19:23] RECOVERY - Host ms-be2043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [17:19:23] RECOVERY - Host ms-fe2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [17:19:25] RECOVERY - Host backup2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.93 ms [17:19:33] RECOVERY - Host cp2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms [17:21:03] (03PS2) 10Hashar: doc: grant access to contint-admins to doc1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/480716 (https://phabricator.wikimedia.org/T211974) [17:21:17] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) Ok, email sent. [17:22:36] (03PS1) 10Hashar: doc: grant doc-uploader access to contint users [puppet] - 10https://gerrit.wikimedia.org/r/480798 (https://phabricator.wikimedia.org/T211974) [17:22:39] !log addshore@deploy1001 Synchronized wmf-config: Wikibase: wikidatawiki upsert idGenerator, T194299 (duration: 00m 52s) [17:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:42] T194299: Lock wait timeout exceeded in SqlIdGenerator::generateNewId - https://phabricator.wikimedia.org/T194299 [17:22:56] (03CR) 10Dzahn: [C: 03+2] "existing group to a new host for things split off from old host that people already have access to, so not a new access in practical sense" [puppet] - 10https://gerrit.wikimedia.org/r/480716 (https://phabricator.wikimedia.org/T211974) (owner: 10Hashar) [17:23:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:27:11] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:28:14] (03PS1) 10Fsero: Adding registries VMs. [puppet] - 10https://gerrit.wikimedia.org/r/480800 (https://phabricator.wikimedia.org/T212212) [17:30:10] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) Faidon suggested a microservice that runs on PuppetDB host and exports the pertinent fa... [17:30:24] (03CR) 10Fsero: "it's my first time adding a VM and a new node over puppet so I'm probably missing something." [puppet] - 10https://gerrit.wikimedia.org/r/480800 (https://phabricator.wikimedia.org/T212212) (owner: 10Fsero) [17:30:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:32:16] !log SWAT done [17:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:22] (03PS1) 10Hashar: doc: add Apache config for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/480802 (https://phabricator.wikimedia.org/T211974) [17:34:17] (03CR) 10jerkins-bot: [V: 04-1] doc: add Apache config for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/480802 (https://phabricator.wikimedia.org/T211974) (owner: 10Hashar) [17:34:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:35:17] (03PS2) 10Hashar: doc: add Apache config for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/480802 (https://phabricator.wikimedia.org/T211974) [17:38:07] (03PS1) 10Hashar: contint: cleanup legacy doc.wikimedia.org apache config [puppet] - 10https://gerrit.wikimedia.org/r/480804 [17:38:43] (03CR) 10Hashar: [C: 04-1] "Pending migration of doc.wikimedia.org. This is a cleanup change following https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480802/" [puppet] - 10https://gerrit.wikimedia.org/r/480804 (owner: 10Hashar) [17:39:57] (03PS3) 10Dzahn: doc: add Apache config for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/480802 (https://phabricator.wikimedia.org/T211974) (owner: 10Hashar) [17:40:03] (03CR) 10Dzahn: [C: 03+2] doc: add Apache config for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/480802 (https://phabricator.wikimedia.org/T211974) (owner: 10Hashar) [17:44:19] (03CR) 10Jeena Huneidi: "I am not completely familiar with the review culture yet but since I am listed as the 'maintainer' I guess I should have felt comfortable " [deployment-charts] - 10https://gerrit.wikimedia.org/r/480484 (owner: 10Alexandros Kosiaris) [17:45:13] RECOVERY - MegaRAID on db1072 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [17:45:40] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4793282, @Marostegui wrote: > What's the expected growth for that table? Once Jade is fully accepted by... [17:46:39] (03PS2) 10Cmjohnson: Adding mgmt dns cloudvirt1025-30 [dns] - 10https://gerrit.wikimedia.org/r/480786 (https://phabricator.wikimedia.org/T209616) [17:47:08] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns cloudvirt1025-30 [dns] - 10https://gerrit.wikimedia.org/r/480786 (https://phabricator.wikimedia.org/T209616) (owner: 10Cmjohnson) [17:58:56] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T212185 (10Cmjohnson) 05Open→03Resolved The disk is back RECOVERY - MegaRAID on db1072 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy cmjohnson@db1072:~$ sudo megacli -PDList -aALL |grep "Firmware... [18:03:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/480252 (https://phabricator.wikimedia.org/T211859) (owner: 10Herron) [18:05:24] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review, 10User-jijiki: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10Papaul) [18:05:48] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review, 10User-jijiki: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10Papaul) a:05Papaul→03ayounsi [18:06:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:07:30] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:07:59] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T212185 (10Marostegui) Thank you!! [18:08:42] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:10:32] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:18:40] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [18:25:52] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [18:28:49] !log replace `interface-range vlan-private1-b-eqiad member ge-6/0/*` with individual interfaces on asw2-b-eqiad [18:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:37] (03CR) 10Volans: [C: 04-2] "Given that the newer Spicerack package was not deployed today and we decided to postpone it to the first week after the holidays, voting -" [cookbooks] - 10https://gerrit.wikimedia.org/r/480072 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [18:29:55] (03CR) 10Volans: [C: 04-2] "Given that the newer Spicerack package was not deployed today and we decided to postpone it to the first week after the holidays, voting -" [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema) [18:30:15] (03CR) 10Volans: [C: 04-2] "Given that the newer Spicerack package was not deployed today and we decided to postpone it to the first week after the holidays, voting -" [cookbooks] - 10https://gerrit.wikimedia.org/r/479463 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [18:34:55] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205867 (10Volans) Current status is: - ~95% of the library migrated - documentation done - other wmf-* script partially done (peding... [18:37:26] PROBLEM - SSH db2057.mgmt on db2057.mgmt is CRITICAL: connect to address 10.193.2.145 and port 22: Connection refused [18:37:51] papaul: ^ [18:38:33] marostegui: checking [18:38:45] Thanks! [18:41:24] !log Stop MySQL on db2057 [18:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:07] 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10Dzahn) shell access for existing groups contint-admins and contint-users has been granted (same access people had be... [18:47:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10Cmjohnson) [18:50:25] (03PS1) 10CDanis: profile::grafana: install sqlite3 [puppet] - 10https://gerrit.wikimedia.org/r/480811 [18:51:34] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:51:38] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4834319, @Marostegui wrote: > Other than a possible misbehaviour of the optimizer, they look ok to me. W... [18:52:06] (03CR) 10CDanis: [C: 03+2] profile::grafana: install sqlite3 [puppet] - 10https://gerrit.wikimedia.org/r/480811 (owner: 10CDanis) [18:56:26] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:57:02] PROBLEM - Host db2057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:57:28] (03PS1) 10Cmjohnson: Adding dhcpd/netboot.cfg entries cloudvirt1025-30 [puppet] - 10https://gerrit.wikimedia.org/r/480812 (https://phabricator.wikimedia.org/T209616) [18:58:30] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [19:00:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10Cmjohnson) @RobH these are ready for installs I added the mac address and netboot.cfg I did not merge the changes, please revie... [19:00:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10Cmjohnson) a:05Cmjohnson→03RobH [19:01:38] (03CR) 10Volans: "Nitpick inline, just FYI, not need to change it I guess." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480811 (owner: 10CDanis) [19:01:52] PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.002 second response time [19:04:32] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [19:05:38] (03CR) 10CDanis: [C: 03+2] profile::grafana: install sqlite3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480811 (owner: 10CDanis) [19:06:40] RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.013 second response time [19:07:47] !log cp of files to ext drive on labstore1007 [19:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:50] RECOVERY - Host db2057.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms [19:08:34] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [19:09:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [19:11:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10Cmjohnson) @robh also, the 2nd ethernet port was placed in cloud-virt-instance-trunk [19:11:34] PROBLEM - IPMI Sensor Status on ms-be2048 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [19:14:54] (03CR) 10Volans: profile::grafana: install sqlite3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480811 (owner: 10CDanis) [19:15:48] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:16:33] 10Operations, 10ops-eqiad: restbase1011 fails to boot, ASSERT error lines - https://phabricator.wikimedia.org/T212305 (10Cmjohnson) @MoritzMuehlenhoff did a hard power cycle and the server came up clean, I've never seen the ASSERT messages. Typically if there is a h/w error on HP I will get a yellow notice... [19:16:45] (03CR) 10CDanis: [C: 03+2] profile::grafana: install sqlite3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480811 (owner: 10CDanis) [19:17:34] that's fine cdanis, as I said earlied, no prob to keep it as is, was mostly FYI :) [19:18:22] I am still not sure why your way is usually more readable tbh :D [19:18:53] (03PS3) 10Herron: logstash::collector: pin curator to components/spicerack on stretch [puppet] - 10https://gerrit.wikimedia.org/r/480252 (https://phabricator.wikimedia.org/T211859) [19:20:26] (03CR) 10Herron: [C: 03+2] logstash::collector: pin curator to components/spicerack on stretch [puppet] - 10https://gerrit.wikimedia.org/r/480252 (https://phabricator.wikimedia.org/T211859) (owner: 10Herron) [19:23:04] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:28:40] !log reactive BGP sessions to telia on cr1-codfw - T211715 [19:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:42] T211715: Interface errors on cr1-codfw:xe-5/3/1 - https://phabricator.wikimedia.org/T211715 [19:29:16] RECOVERY - SSH db2057.mgmt on db2057.mgmt is OK: SSH OK - mpSSH_0.2.1 (protocol 2.0) [19:30:15] (03PS1) 10Ayounsi: Revert "Depool codfw for row D recabling" [dns] - 10https://gerrit.wikimedia.org/r/480814 [19:30:26] (03CR) 10GTirloni: [C: 04-1] "We need a better strategy for custom container images. We can't fall into the same hole we did with Grid Engine hosts where we just instal" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/480159 (https://phabricator.wikimedia.org/T151656) (owner: 10MaxSem) [19:30:28] (03PS1) 10Ayounsi: Revert "Redirect eqsin/ulsfo caches to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/480815 [19:31:51] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool codfw for row D recabling" [dns] - 10https://gerrit.wikimedia.org/r/480814 (owner: 10Ayounsi) [19:31:56] (03PS2) 10Ayounsi: Revert "Depool codfw for row D recabling" [dns] - 10https://gerrit.wikimedia.org/r/480814 [19:31:59] (03CR) 10Ayounsi: [C: 03+2] Revert "Redirect eqsin/ulsfo caches to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/480815 (owner: 10Ayounsi) [19:32:01] (03CR) 10MaxSem: "Then please comment on the ticket, we would love to go with whatever route proposed, but we need _some_ route." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/480159 (https://phabricator.wikimedia.org/T151656) (owner: 10MaxSem) [19:32:04] (03PS2) 10Ayounsi: Revert "Redirect eqsin/ulsfo caches to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/480815 [19:32:11] !log repool codfw - T210467 [19:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:14] T210467: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 [19:33:41] !log Revert "Redirect eqsin/ulsfo caches to eqiad" - T210467 [19:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:11] 10Operations, 10ops-codfw, 10netops: upgrade all codfw switch stacks to include additional 10G switch per row - https://phabricator.wikimedia.org/T196489 (10ayounsi) [19:35:17] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review, 10User-jijiki: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10ayounsi) 05Open→03Resolved This has been completed under 1h with no issues whatsoever. [19:35:30] 10Operations, 10ops-codfw, 10netops: upgrade all codfw switch stacks to include additional 10G switch per row - https://phabricator.wikimedia.org/T196489 (10ayounsi) [19:35:44] 10Operations, 10ops-codfw, 10netops: upgrade all codfw switch stacks to include additional 10G switch per row - https://phabricator.wikimedia.org/T196489 (10ayounsi) 05Open→03Resolved All child tasks done. [19:46:31] 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10Dzahn) 05Open→03Resolved Yes, the VM has been created, basic role has been created, users added, httpd installed... [19:47:03] 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Kanban): eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10Dzahn) [19:47:19] (03PS1) 10Dzahn: doc: copy new httpd config for doc.wm.org to its own place [puppet] - 10https://gerrit.wikimedia.org/r/480817 (https://phabricator.wikimedia.org/T137890) [19:48:53] 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Kanban): eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10hashar) Thank you for the quick spinning of the instance as well as all the preliminary puppet work. Much appreciated :) [19:50:20] (03CR) 10Dzahn: "new erb file is just from 'cp modules/contint/templates/apache/doc.wikimedia.org.erb"" [puppet] - 10https://gerrit.wikimedia.org/r/480817 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [19:50:22] (03CR) 10Hashar: "almost :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480817 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [19:51:12] (03CR) 10Dzahn: doc: copy new httpd config for doc.wm.org to its own place (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480817 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [19:52:16] mutante: my puppet is way too rusty sorry :) [19:52:23] +1 on copying the config [19:52:34] hashar: ok, :) thx [19:52:38] (03CR) 10Hashar: [C: 03+1] "My puppet is rusty :)" [puppet] - 10https://gerrit.wikimedia.org/r/480817 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [19:53:18] (03PS2) 10Dzahn: doc: copy new httpd config for doc.wm.org to its own place [puppet] - 10https://gerrit.wikimedia.org/r/480817 (https://phabricator.wikimedia.org/T137890) [19:53:26] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/480817 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [19:57:31] (03CR) 10Hashar: [C: 03+1] doc: copy new httpd config for doc.wm.org to its own place (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480817 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [20:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181219T2000) [20:00:08] (03PS1) 10Hashar: group1 wikis to 1.33.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480818 [20:00:10] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.33.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480818 (owner: 10Hashar) [20:00:16] just in time [20:00:52] oh, train time [20:01:24] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480818 (owner: 10Hashar) [20:01:27] (03CR) 10Legoktm: [C: 03+1] When running scripts from staging, use the CommonSettings.php from staging [puppet] - 10https://gerrit.wikimedia.org/r/480695 (owner: 10Tim Starling) [20:01:36] (03PS3) 10Dzahn: doc: copy new httpd config for doc.wm.org to its own place [puppet] - 10https://gerrit.wikimedia.org/r/480817 (https://phabricator.wikimedia.org/T137890) [20:02:43] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.9 [20:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:57] (03CR) 10Dzahn: doc: copy new httpd config for doc.wm.org to its own place (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480817 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [20:03:34] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.9 (duration: 00m 51s) [20:03:35] (03CR) 10Dzahn: [C: 03+2] doc: copy new httpd config for doc.wm.org to its own place [puppet] - 10https://gerrit.wikimedia.org/r/480817 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [20:03:37] (03CR) 10Andrew Bogott: [C: 04-1] "I know this is different from what I originally requested, but let's try imaging these as Stretch -- it might not take but if we can make " [puppet] - 10https://gerrit.wikimedia.org/r/480812 (https://phabricator.wikimedia.org/T209616) (owner: 10Cmjohnson) [20:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:39] 10Operations, 10TechCom, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Milimetric) In https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service we see that "There is a server-side and the client-side variant o... [20:07:02] PROBLEM - Nginx local proxy to apache on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:52] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Eevans) >>! In T212129#4834329, @mark wrote: > I am getting the impression here that some thing... [20:08:08] RECOVERY - Nginx local proxy to apache on mw1332 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.044 second response time [20:11:45] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480818 (owner: 10Hashar) [20:15:56] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [20:18:35] ^ mutante [20:22:54] herron: distracted! sry, fixed [20:23:00] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [20:23:12] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [20:23:33] (03PS1) 10Andrew Bogott: Horizon: move more projects to eqiad1-r: [puppet] - 10https://gerrit.wikimedia.org/r/480819 (https://phabricator.wikimedia.org/T204745) [20:24:35] (03PS1) 10Legoktm: planet: Add Farida's blog (Outreachy Intern) [puppet] - 10https://gerrit.wikimedia.org/r/480820 [20:25:06] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: move more projects to eqiad1-r: [puppet] - 10https://gerrit.wikimedia.org/r/480819 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [20:32:40] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [20:33:19] (03PS1) 10Dzahn: doc: httpd, add file handler for .php files -> PHP-FPM [puppet] - 10https://gerrit.wikimedia.org/r/480821 (https://phabricator.wikimedia.org/T137890) [20:34:31] (03PS2) 10Dzahn: doc: httpd, add file handler for .php files -> PHP-FPM [puppet] - 10https://gerrit.wikimedia.org/r/480821 (https://phabricator.wikimedia.org/T137890) [20:38:51] (03PS3) 10Dzahn: doc: httpd, add file handler for .php files -> PHP-FPM [puppet] - 10https://gerrit.wikimedia.org/r/480821 (https://phabricator.wikimedia.org/T137890) [20:44:58] !log 1.33.0-wmf.9 on group1 looks fine. [20:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:27] (03CR) 10Dzahn: [C: 03+2] doc: httpd, add file handler for .php files -> PHP-FPM [puppet] - 10https://gerrit.wikimedia.org/r/480821 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [20:50:34] (03PS4) 10Dzahn: doc: httpd, add file handler for .php files -> PHP-FPM [puppet] - 10https://gerrit.wikimedia.org/r/480821 (https://phabricator.wikimedia.org/T137890) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear deployers, time to do the Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181219T2100). [21:01:10] nothing for parsoid [21:03:20] (03PS1) 10Mathew.onipe: cirrus: increase number of shards for enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 [21:04:08] (03PS2) 10Mathew.onipe: cirrus: increase number of shards for enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) [21:05:36] (03CR) 10Framawiki: [C: 03+1] Revert "Block WP Zero users from accessing Phabricator uploads" [puppet] - 10https://gerrit.wikimedia.org/r/479399 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [21:09:55] (03PS1) 10Dzahn: doc: remove "deprecated user of DefaultType" [puppet] - 10https://gerrit.wikimedia.org/r/480830 [21:12:08] (03PS1) 10Hashar: scap configuration for integration/docroot.git [puppet] - 10https://gerrit.wikimedia.org/r/480832 (https://phabricator.wikimedia.org/T137890) [21:17:22] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [21:17:44] 10Puppet, 10Cloud-VPS: Something is wrong with puppet on the jessie bootstrapvz instance - https://phabricator.wikimedia.org/T212119 (10Andrew) It's been doing this for a while. There's a hacked version of bootstrap-vz running there ( https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/VM_images#Custom... [21:17:53] 10Puppet, 10Cloud-VPS: Something is wrong with puppet on the jessie bootstrapvz instance - https://phabricator.wikimedia.org/T212119 (10Andrew) 05Open→03Declined [21:19:15] (03CR) 10Thcipriani: [C: 03+1] "This will setup the dsh group and the repo on deployment hosts, will also need to add scap::target to the doc host so that scap can deploy" [puppet] - 10https://gerrit.wikimedia.org/r/480832 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [21:20:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [21:21:41] (03PS1) 10CDanis: profile::grafana::production: datasources as YAML [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) [21:22:15] (03CR) 10jerkins-bot: [V: 04-1] profile::grafana::production: datasources as YAML [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) (owner: 10CDanis) [21:22:38] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [21:23:33] (03PS1) 10Framawiki: Specify allowed ldap groups by site logins [puppet] - 10https://gerrit.wikimedia.org/r/480869 [21:24:14] (03PS2) 10CDanis: profile::grafana::production: datasources as YAML [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) [21:25:18] (03CR) 10jerkins-bot: [V: 04-1] profile::grafana::production: datasources as YAML [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) (owner: 10CDanis) [21:25:38] (03CR) 10Dzahn: [C: 03+2] ""This directive has no effect other than to emit warnings if the value is not none. In prior versions, DefaultType would specify a default" [puppet] - 10https://gerrit.wikimedia.org/r/480830 (owner: 10Dzahn) [21:25:46] (03PS2) 10Dzahn: doc: remove "deprecated user of DefaultType" [puppet] - 10https://gerrit.wikimedia.org/r/480830 [21:25:51] (03PS3) 10CDanis: profile::grafana::production: datasources as YAML [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) [21:26:18] (03CR) 10Volans: profile::grafana::production: datasources as YAML (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) (owner: 10CDanis) [21:26:35] (03CR) 10jerkins-bot: [V: 04-1] profile::grafana::production: datasources as YAML [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) (owner: 10CDanis) [21:28:14] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [21:31:47] (03CR) 10CDanis: profile::grafana::production: datasources as YAML (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) (owner: 10CDanis) [21:31:59] (03PS2) 10Hashar: scap configuration for integration/docroot.git [puppet] - 10https://gerrit.wikimedia.org/r/480832 (https://phabricator.wikimedia.org/T137890) [21:32:03] (03PS4) 10CDanis: profile::grafana::production: datasources as YAML [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) [21:32:57] (03CR) 10jerkins-bot: [V: 04-1] scap configuration for integration/docroot.git [puppet] - 10https://gerrit.wikimedia.org/r/480832 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [21:33:48] (03PS3) 10Dzahn: doc: remove "deprecated user of DefaultType" [puppet] - 10https://gerrit.wikimedia.org/r/480830 (https://phabricator.wikimedia.org/T137890) [21:33:59] (03CR) 10Dzahn: [C: 03+2] doc: remove "deprecated user of DefaultType" [puppet] - 10https://gerrit.wikimedia.org/r/480830 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [21:41:40] (03PS12) 10BBlack: New zone generator gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/479892 [21:41:42] (03PS1) 10BBlack: deploy-check.py replaces check-gdnsd.sh [dns] - 10https://gerrit.wikimedia.org/r/480870 [21:41:44] (03PS1) 10BBlack: Remove authdns-gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/480871 [21:41:51] (03PS1) 10BBlack: authdns-local-update: use check-deploy.py [puppet] - 10https://gerrit.wikimedia.org/r/480872 [21:41:53] (03PS1) 10BBlack: authdns::scripts: no more python-jinja2 [puppet] - 10https://gerrit.wikimedia.org/r/480873 [21:43:23] (03Abandoned) 10BBlack: [WIP] authdns-local-update: use check-gdnsd/gen-zones [puppet] - 10https://gerrit.wikimedia.org/r/480477 (owner: 10BBlack) [21:44:32] (03CR) 10Dzahn: "all of these are analytics services and files written by ottomata. adding him." [puppet] - 10https://gerrit.wikimedia.org/r/480869 (owner: 10Framawiki) [21:45:01] (03PS5) 10CDanis: profile::grafana::production: datasources as YAML [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) [21:45:45] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Redundancy Policy on this system was set to Not redundant or on the other working system it was set to redundant so we change the settings for this system to redundant as well.... [21:52:01] (03PS2) 10BBlack: deploy-check.py replaces check-gdnsd.sh [dns] - 10https://gerrit.wikimedia.org/r/480870 [21:52:04] (03PS2) 10BBlack: Remove authdns-gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/480871 [21:54:52] (03PS3) 10Hashar: scap configuration for integration/docroot.git [puppet] - 10https://gerrit.wikimedia.org/r/480832 (https://phabricator.wikimedia.org/T137890) [21:56:32] (03CR) 10Hashar: "And I have fixed the spec :)" [puppet] - 10https://gerrit.wikimedia.org/r/480832 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [22:05:15] (03CR) 10Volans: profile::grafana::production: datasources as YAML (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) (owner: 10CDanis) [22:05:54] 10Operations, 10ops-eqiad, 10netops: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10ayounsi) p:05Triage→03Normal [22:07:48] (03CR) 10Hashar: "I have no clue how that will behave when material is rsynced under the sub directory org/wikimedia/doc" [puppet] - 10https://gerrit.wikimedia.org/r/480832 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [22:09:43] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ayounsi) [22:09:45] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ayounsi) [22:09:47] 10Operations, 10ops-eqiad, 10netops: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10ayounsi) [22:11:14] hello [22:11:18] anyone around? [22:12:27] OlEnglish: lots of lurkers. Are you having a problem? [22:12:52] I keep getting a strange "internal error" and I'm wondering if someone else here can repeat it [22:12:58] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:13:22] https://en.wikipedia.org/wiki/Special:AbuseLog/8229629 [22:13:27] it occurs from that link [22:13:44] (03PS6) 10CDanis: profile::grafana::production: datasources as YAML [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) [22:13:45] yep. let me lookup the error code in our loggin system [22:13:59] thanks [22:14:20] (03CR) 10CDanis: profile::grafana::production: datasources as YAML (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) (owner: 10CDanis) [22:15:14] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [22:15:17] !log scb1003: systemctl restart pdfrender [22:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:22] OlEnglish: have you started a bug report for this in Phabricator yet? [22:15:28] no i haven't [22:15:32] please feel free [22:15:55] *nod* I'll do that and include the stack trace I'm seeing in the logs [22:16:04] thank you [22:17:13] (03CR) 10Volans: [C: 03+1] "LGTM! I didn't double check all the datasources, but I know the method you used to generate them and that you'll do a backup just in case " [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) (owner: 10CDanis) [22:20:08] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [22:20:52] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) @kaldari We have deployed librsvg 2.40.20-3 on deployment-imagescaler03 under debian stretch, after some testing it does't look like SVG rendering... [22:26:26] (03Abandoned) 10Hashar: scap configuration for integration/docroot.git [puppet] - 10https://gerrit.wikimedia.org/r/480832 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [22:28:38] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [22:33:18] (03CR) 10Dzahn: [C: 03+1] "first i thought this is a hostgroup for all "mgmt" hosts as in DRAC consoles, but i see it means "cumin master" basically, yea, it follows" [puppet] - 10https://gerrit.wikimedia.org/r/480664 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [22:34:03] (03PS2) 10Dzahn: planet: Add Farida's blog (Outreachy Intern) [puppet] - 10https://gerrit.wikimedia.org/r/480820 (owner: 10Legoktm) [22:34:55] (03CR) 10Dzahn: [C: 03+2] planet: Add Farida's blog (Outreachy Intern) [puppet] - 10https://gerrit.wikimedia.org/r/480820 (owner: 10Legoktm) [22:36:16] (03CR) 10CDanis: [C: 03+2] profile::grafana::production: datasources as YAML [puppet] - 10https://gerrit.wikimedia.org/r/480833 (https://phabricator.wikimedia.org/T211979) (owner: 10CDanis) [22:36:29] (03PS3) 10Dzahn: planet: Add Farida's blog (Outreachy Intern) [puppet] - 10https://gerrit.wikimedia.org/r/480820 (owner: 10Legoktm) [22:37:05] (03CR) 10Volans: [C: 04-1] "Looks ok, just a missing thing inline and a question." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/480714 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto) [22:42:50] (03PS1) 10Hashar: doc: clone integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/480879 (https://phabricator.wikimedia.org/T137890) [22:43:27] (03CR) 10jerkins-bot: [V: 04-1] doc: clone integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/480879 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [22:43:27] 10Operations, 10monitoring, 10Patch-For-Review, 10Performance-Team (Radar), 10User-CDanis: Upgrade grafana to 5.x - https://phabricator.wikimedia.org/T210416 (10CDanis) [22:45:32] (03CR) 10Dzahn: "please see Filippo's comment at https://phabricator.wikimedia.org/T78705#4823287 do you want to use the prod logging cluster instead? th" [puppet] - 10https://gerrit.wikimedia.org/r/479567 (https://phabricator.wikimedia.org/T78705) (owner: 10Dduvall) [22:49:52] (03CR) 10Dzahn: [C: 04-1] "i already made a copy of the file for the new location and removed the lines from .fixtures. the only difference is i kept it as a templat" [puppet] - 10https://gerrit.wikimedia.org/r/480804 (owner: 10Hashar) [22:50:34] (03PS2) 10Hashar: doc: clone integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/480879 (https://phabricator.wikimedia.org/T137890) [22:50:36] (03PS1) 10Hashar: git: git::clone requires the git package [puppet] - 10https://gerrit.wikimedia.org/r/480880 [22:50:37] 10Operations, 10Jade, 10TechCom, 10Epic, and 3 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10awight) [22:51:25] (03CR) 10jerkins-bot: [V: 04-1] git: git::clone requires the git package [puppet] - 10https://gerrit.wikimedia.org/r/480880 (owner: 10Hashar) [22:51:28] (03Abandoned) 10Hashar: contint: cleanup legacy doc.wikimedia.org apache config [puppet] - 10https://gerrit.wikimedia.org/r/480804 (owner: 10Hashar) [22:51:53] I should stop working so late [22:52:55] (03PS2) 10Hashar: git: git::clone requires the git package [puppet] - 10https://gerrit.wikimedia.org/r/480880 [22:54:33] (03PS3) 10Hashar: doc: clone integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/480879 (https://phabricator.wikimedia.org/T137890) [22:55:47] 10Operations, 10ops-codfw: Interface errors on cr1-codfw:xe-5/3/1 - https://phabricator.wikimedia.org/T211715 (10ayounsi) 05Open→03Resolved a:05Papaul→03ayounsi After many back and forth by email the issue has been fixed: > Dear customer, > After having our transmission vendor troubleshooting the DWDM... [23:00:11] (03PS1) 10Hashar: doc: relocate from /srv to /srv/docroot [puppet] - 10https://gerrit.wikimedia.org/r/480881 (https://phabricator.wikimedia.org/T137890) [23:01:54] (03CR) 10Hashar: "That is merely for a profile test which uses git::clone and save us from having to include the whole base module :)" [puppet] - 10https://gerrit.wikimedia.org/r/480880 (owner: 10Hashar) [23:02:16] (03CR) 10Dzahn: "so you do not want automatic cloning of new changes, right? then this is good because ensure defaults to "present" and not "latest". maybe" [puppet] - 10https://gerrit.wikimedia.org/r/480879 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [23:03:54] (03CR) 10Hashar: "Note how it is cloned to /srv/docroot , that is so we can own the base directory. /srv is owned by root:root and that thus does not work" [puppet] - 10https://gerrit.wikimedia.org/r/480879 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [23:03:59] mutante: ah yes :) [23:04:11] mutante: git::clone ensure=>presnet I think that is the safe path for now. [23:04:29] can always make it to use ensure => latest later, but I am not sure it is actually wanted [23:04:49] I also made it to be cloned under /srv/docroot since /srv is owned by root:root [23:05:04] and the next patch update apache/bacula/rsync etc https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480881/ [23:05:20] !log krinkle@deploy1001 Started deploy [performance/navtiming@64e3f63]: (no justification provided) [23:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:26] !log krinkle@deploy1001 Finished deploy [performance/navtiming@64e3f63]: (no justification provided) (duration: 00m 05s) [23:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:41] (03CR) 10Hashar: "That is to address T149924 since we are starting with a fresh host :]" [puppet] - 10https://gerrit.wikimedia.org/r/480881 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [23:06:02] hashar: ok, yea, i dont love the name but i get the reason :) was already looking [23:06:21] sorry I remembered about that old task (stop using /srv) a bit too late [23:07:09] it bugs me a bit when /srv/docroot is not the document root :) [23:07:48] yeah that is because integration/docroot.git is shared between doc.wm.o and integration.wm.o [23:08:03] https://puppet-compiler.wmflabs.org/compiler1002/14017/ [23:08:07] failed to compile bah [23:08:19] Unable to find facts for host doc1001.eqiad.wmnet hehe [23:08:44] hashar: oh yea.. that. because of that i asked earlier how do sync the new facts to the compiler [23:08:48] that happens for new hosts [23:09:11] that's also why i did the "check experimental" earlier [23:09:13] mutante: https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet3-diffs/Documentation :) [23:09:35] aaah :) [23:09:47] gotta update it [23:12:16] mutante: updated :) [23:12:39] oh you did, i was still at "cant resolve host name", cool [23:14:30] reads the bash script. great hashar, i wanted to know how this works for next time [23:14:43] yeah that is useful to know :) [23:18:16] hashar: are you admin in puppet3-diffs ? [23:18:54] hmm [23:19:04] oh that is the old project [23:19:05] i am member of puppet-diffs but not puppet3-diffs [23:19:11] that got changed to puppet-diffs [23:19:19] did you update the old project or am i in the old project, heh [23:19:26] I think herron did the migration of the puppet compilers \o/ [23:19:49] at least I am a member of puppet-diffs [23:19:57] the instances here are called "compiler1001" and "compiler1002' [23:20:09] but i cant ssh to compiler02.puppet3-diffs [23:20:15] not a member in that one [23:20:38] yeah it is gone [23:20:44] probably means the script needs to be updated [23:21:06] but then how did the script work for you :) [23:21:24] I cant use it since I dont have access to the prod puppet master [23:21:33] but if you set PUPPET_COMPILER=compiler1001.puppet-diffs.eqiad.wmflabs [23:21:39] that should point to that new host [23:22:04] oh [23:22:05] https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs/Documentation [23:22:10] i thought you meant by "updated" that you ran it [23:22:53] ok :)! [23:24:27] !log syncing facts from puppetmaster1001 to compiler1001/compiler1002 [23:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:54] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/14018/console [23:26:32] hashar: it has not finished.. it just claimed it started and that's it so far [23:27:07] bah [23:27:18] that is a rabbit hole :) [23:28:16] hashar: i believe the first compiler host is done now [23:28:45] also i got a certificate warning about no subjectAltName for cert for puppetdb..but that told me it continued [23:29:13] now connected to compiler1002 [23:31:52] hashar: script finished now [23:32:17] due to my ssh config and connecting to both labs and prod i had to enter a key passphrase about 12 times, but besides that it worked [23:33:36] hashar: it worked, not 404 anymore https://puppet-compiler.wmflabs.org/compiler1002/14020/doc1001.eqiad.wmnet/ [23:33:59] :) nice, it has bugged me multiple times before i did not have this setup [23:34:00] ah good :) [23:34:17] so basically /srv => /srv/docroot [23:34:31] this way the base dir is not root:root owned :) [23:34:44] that means moving files around though [23:35:25] I will further polish up the CI jobs tomorrow :D [23:35:51] for now, it is almost 1am so time to sleep a bit [23:37:00] ok, yea, have a good night hashar [23:39:18] (03CR) 10Dzahn: [C: 03+2] git: git::clone requires the git package [puppet] - 10https://gerrit.wikimedia.org/r/480880 (owner: 10Hashar) [23:39:56] (03PS1) 10Kaldari: Adding NOINDEX template to $wgPageTriageNoIndexTemplates for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480884 (https://phabricator.wikimedia.org/T211043) [23:41:49] (03PS1) 10Volans: Upstream release v0.0.10 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/480885