[00:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171207T0000). [00:00:04] eddiegp: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:37] o/ [00:01:06] paladox: Heh, `bazel build release` expects you to have `zip` installed. [00:01:14] The docs don't say that! They just say Java 8 + node! [00:01:15] :) [00:01:28] (spun up a fresh VM, that's how I noticed) [00:06:28] I can SWAT [00:06:44] no_justification lol. [00:07:33] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) (owner: 10EddieGP) [00:08:13] (03CR) 10jerkins-bot: [V: 04-1] alswiki: Set wgRestrictDisplayTitle = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) (owner: 10EddieGP) [00:09:22] 10Operations, 10Puppet, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3672181 (10Dzahn) Hi, is puppetcompiler1001 going to stay in site.pp permanently or was it a temporary thing? [00:09:31] well neat [00:09:31] Shut up jerkins. This seems to be the problem just reported in -releng [00:09:37] yeah [00:09:51] https://phabricator.wikimedia.org/T182266#3818591 [00:12:29] (03PS1) 10Dzahn: mirrors,poolcounter,tendril,tor,labpuppetmaster,openldap::lab: rm ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395896 (https://phabricator.wikimedia.org/T177225) [00:13:33] (03CR) 10Dzahn: [C: 032] mirrors,poolcounter,tendril,tor,labpuppetmaster,openldap::lab: rm ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395896 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:15:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [00:16:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [00:18:38] paladox: Confirmed, jars in archiva are wrong, ones in the git repo are correct [00:18:40] Fixing! [00:18:48] thanks :) [00:20:31] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3818603 (10Pchelolo) For the reference, next time we migrate recursive jobs we need to switch off Redis queue production before switching on Kafka consumption.... [00:20:55] (03PS1) 10Dzahn: ganglia/labpuppetmaster: fix typo in hieradata path [puppet] - 10https://gerrit.wikimedia.org/r/395897 [00:21:42] (03CR) 10Dzahn: [C: 032] ganglia/labpuppetmaster: fix typo in hieradata path [puppet] - 10https://gerrit.wikimedia.org/r/395897 (owner: 10Dzahn) [00:25:38] thcipriani: Are you done with SWAT yet? Can I SWAT another patch? [00:26:08] Swat seems to be on hold due to the CI :/ [00:26:13] RoanKattouw: hadn't started actually, there are CI problems that I'm fiddling with [00:26:28] Oh you're trying to unbreak CI still, I see [00:26:32] Then I'll just edit the wiki page [00:26:58] "Overwriting released artifacts in repository 'releases' is not allowed." [00:27:00] Stupid archiva [00:27:04] oh [00:27:05] I deleted the original ones on purpose [00:27:07] * no_justification stabs [00:27:13] heh [00:27:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [00:27:56] (03PS1) 10Dzahn: puppetcompiler1001: add role(test) [puppet] - 10https://gerrit.wikimedia.org/r/395900 (https://phabricator.wikimedia.org/T177843) [00:28:42] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [00:29:21] https://archiva.wikimedia.org/#advancedsearch/~~2.14.6~~~ [00:30:00] Ahhh, it deletes the artifact but not the manifest entry [00:30:00] do those have new sha? [00:30:05] ah [00:30:13] Well that's annoying as hell [00:30:27] heh [00:30:32] (03PS2) 10Dzahn: puppetcompiler1001: add role(test) [puppet] - 10https://gerrit.wikimedia.org/r/395900 (https://phabricator.wikimedia.org/T177843) [00:31:05] (03CR) 10Dzahn: [C: 032] puppetcompiler1001: add role(test) [puppet] - 10https://gerrit.wikimedia.org/r/395900 (https://phabricator.wikimedia.org/T177843) (owner: 10Dzahn) [00:31:43] no_justification i see you just did something :) [00:31:43] 0e5539a9453a583a36034aa59a906674dc814fcb [00:31:43] 220,675 100% 42.09MB/s 0:00:00 (xfr#1, to-chk=0/1) [00:32:18] Wait, what did I do? [00:32:24] (only did it for one of them). [00:32:33] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:32:43] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:32:47] (03CR) 10Thcipriani: alswiki: Set wgRestrictDisplayTitle = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) (owner: 10EddieGP) [00:32:51] (03PS3) 10Thcipriani: alswiki: Set wgRestrictDisplayTitle = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) (owner: 10EddieGP) [00:32:52] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:32:55] Oh wait, they went through [00:33:00] Maybe something was cached? [00:33:08] stat1005 = not me [00:33:12] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:33:12] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:33:14] only one though [00:33:23] the rest are still showing that error [00:33:23] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:33:32] (03CR) 10jerkins-bot: [V: 04-1] alswiki: Set wgRestrictDisplayTitle = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) (owner: 10EddieGP) [00:33:47] Try again [00:33:52] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [00:34:12] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [00:34:12] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [00:34:23] RECOVERY - Disk space on stat1005 is OK: DISK OK [00:34:33] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [00:34:43] RECOVERY - DPKG on stat1005 is OK: All packages OK [00:34:48] ok [00:35:27] (03CR) 10BBlack: [C: 031] admins: new groups for reading varnish/mw logs, debugging [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) (owner: 10Dzahn) [00:35:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [00:36:00] another one went through now [00:36:01] 2ee37bec0997442a406805e463f92632d0b6b89b [00:36:01] 5,396 100% 5.15MB/s 0:00:00 (xfr#1, to-chk=0/1) [00:36:01] 5,396 100% 5.15MB/s 0:00:00 (xfr#1, to-chk=0/1) [00:36:08] Mine's saying skipping non-regular file [00:36:10] seems to be slow with them going through though [00:36:10] w/e that means [00:36:13] But hey, progress! [00:36:17] (03CR) 10Hoo man: [C: 031] admins: new groups for reading varnish/mw logs, debugging [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) (owner: 10Dzahn) [00:37:05] yeh :) [00:37:20] (03PS6) 10Dzahn: admins: new groups for reading varnish/mw logs, debugging [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) [00:37:42] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [00:38:36] (03CR) 10Dzahn: [C: 032] admins: new groups for reading varnish/mw logs, debugging [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) (owner: 10Dzahn) [00:39:00] 4 more to go :) [00:40:30] seems no more have gone through [00:42:03] delete-project.jar and download-commands.jar and its-phabricator.jar and reviewnotes.jar left :) [00:45:28] Ok, I think that's all but its-phabricator [00:46:08] 2 more [00:46:09] yay [00:46:09] 3644ce3621d696d514031ba229f5e21d859f3ab5 [00:46:10] 24,758 100% 23.61MB/s 0:00:00 (xfr#1, to-chk=1/2) [00:46:10] 661df989630b7cdc19b08724c9688220d8d02cc5 [00:46:10] 49,004 100% 46.73MB/s 0:00:00 (xfr#2, to-chk=0/2) [00:46:31] My rsync doesn't like this repo at all, heh [00:46:36] It won't let me git fat pull [00:46:43] "skipping non-regular file" [00:46:46] pshaw [00:47:05] lol [00:47:29] Ok, its-phabricator reuploaded too [00:47:33] Only took 3 tries this time! [00:47:35] Pfft [00:47:53] thanks :) [00:50:31] I think that should be all of them? [00:50:34] * no_justification hopes [00:50:42] I wanna go play Mario already :p [00:51:05] ok it's rsync now :) [00:51:10] lol [00:51:15] (03PS1) 10Dzahn: admins: add hoo to mw-testers,varnish/maintenance-log-readers [puppet] - 10https://gerrit.wikimedia.org/r/395903 (https://phabricator.wikimedia.org/T179317) [00:51:21] shoots green turtles [00:53:28] (03CR) 10Dzahn: [C: 032] admins: add hoo to mw-testers,varnish/maintenance-log-readers [puppet] - 10https://gerrit.wikimedia.org/r/395903 (https://phabricator.wikimedia.org/T179317) (owner: 10Dzahn) [00:54:25] scaping is failing with [00:54:26] rsync: link_stat "/git-fat/1e22db8ef371601755b6279256f4400a8735400f" (in archiva) failed: No such file or directory (2) [00:54:31] no_justification [00:54:47] Blargggghhhhh [00:54:51] I hate this [00:54:53] I give up for today [00:55:33] why do i have an issue on terbium/wasat but not on mwdebug and cache.. i dont see a difference..boo [00:57:29] it's plugins/lfs.jar [01:00:04] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171207T0100). [01:00:05] No GERRIT patches in the queue for this window AFAICS. [01:02:20] thcipriani: Should I reschedule the config patch? [01:03:36] eddiegp: yeah, sorry, still poking at CI :( [01:03:49] Okay :( [01:03:53] Mine fixes a bad bug so I'm gonna wait a bit longer [01:04:07] If CI gets fixed later today I'd like to still deploy my patch if possible [01:05:09] Good luck! Normally I'd wait too, but it's 2am here. I gonna call it a day. [01:06:09] eddiegp: there's 22 hours of it left [01:07:04] lol [01:09:38] Reedy: My personal midnight is more like 4am :P [01:10:53] is there anything different with the "maintenance-log-readers" group vs. the other 2 groups here? https://gerrit.wikimedia.org/r/#/c/394102/6/modules/admin/data/data.yaml [01:11:08] i cant see but for some reason puppet has an issue with that and not the others [01:11:33] already wondered if the name is just too long ... :p [01:14:02] no_justification for when you have time later, i see lfs missing from https://archiva.wikimedia.org/#advancedsearch/~~2.14.6~~~ [01:32:48] (03PS1) 10Dzahn: admins: fix syntax error in sudo privs for maint-log-readers [puppet] - 10https://gerrit.wikimedia.org/r/395908 (https://phabricator.wikimedia.org/T179317) [01:33:25] paladox: Uploaded [01:33:29] * no_justification goes back to pizza and Mario [01:34:22] https://en.wikipedia.org/wiki/The_Great_Giana_Sisters [01:35:11] (03CR) 10Dzahn: [C: 032] admins: fix syntax error in sudo privs for maint-log-readers [puppet] - 10https://gerrit.wikimedia.org/r/395908 (https://phabricator.wikimedia.org/T179317) (owner: 10Dzahn) [01:41:49] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Varnish and Apache debug tools and logs for hoo - https://phabricator.wikimedia.org/T179317#3818710 (10Dzahn) Alright, this should be resolved now. As requested and approved, using 3 new groups as such: on a random varni... [01:43:03] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Varnish and Apache debug tools and logs for hoo - https://phabricator.wikimedia.org/T179317#3818711 (10Dzahn) 05Open>03Resolved Let us know if something isn't working. [01:44:51] no_justification thanks, it works now :) [01:44:58] installed and started correctly [01:45:28] except from it seems to replace the host name for sshd with ipv6. running puppet will bring it back to the work version. [01:51:07] (03CR) 10Paladox: [C: 031] "Works now :)" [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (owner: 10Chad) [02:29:31] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.10) (duration: 08m 57s) [02:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:44] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3818753 (10Pchelolo) The backlog was cleared now, all seems in good shape. [02:32:34] So did CI get fixed? [02:33:06] RoanKattouw: at least temporarily, it seems, but thciprian.i is working on the more premanent one [02:33:11] over in -releng [02:33:46] Thanks, moving there [03:00:14] !log catrope@tin Synchronized php-1.31.0-wmf.11/resources/src/mediawiki.rcfilters/: T182268 (duration: 00m 57s) [03:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:26] T182268: [wmf.11-regression] Saved filters titles are not displayed in Active Filter area - https://phabricator.wikimedia.org/T182268 [03:08:12] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [03:08:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [03:10:12] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [03:10:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [03:16:27] (03PS1) 10Milimetric: Add and link readme pages for analytics datasets [puppet] - 10https://gerrit.wikimedia.org/r/395917 (https://phabricator.wikimedia.org/T167033) [03:24:13] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 710.15 seconds [03:29:41] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create custom per-job metric reporters capability - https://phabricator.wikimedia.org/T182274#3818793 (10Pchelolo) p:05Triage>03Low [03:30:03] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create custom per-job metric reporters capability - https://phabricator.wikimedia.org/T182274#3818807 (10Pchelolo) [03:35:39] (03CR) 10Thcipriani: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) (owner: 10EddieGP) [03:42:23] PROBLEM - Check Varnish expiry mailbox lag on cp4025 is CRITICAL: CRITICAL: expiry mailbox lag is 2113901 [03:48:22] (03Draft2) 10Zppix: Rm all past throttle overrides in throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395918 [04:02:34] (03PS8) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) [04:02:59] (03PS20) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [04:05:13] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 196.04 seconds [04:26:13] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [04:26:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [04:30:13] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [04:30:22] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [04:32:23] RECOVERY - Check Varnish expiry mailbox lag on cp4025 is OK: OK: expiry mailbox lag is 0 [04:45:43] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [04:45:52] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [04:46:03] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [04:46:12] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [04:46:12] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [04:46:23] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [04:46:33] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [04:49:42] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [04:55:43] RECOVERY - DPKG on stat1005 is OK: All packages OK [04:55:52] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [04:56:03] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [04:56:12] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [04:56:12] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [04:56:23] RECOVERY - Disk space on stat1005 is OK: DISK OK [04:56:33] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [05:17:41] (03PS1) 10EBernhardson: Enable more accurate smaps based rss checking [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395923 (https://phabricator.wikimedia.org/T182276) [05:19:43] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Thu 2017-12-07 05:19:38 UTC. [05:19:46] (03CR) 10EBernhardson: "See https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-common/yarn-default.xml for the description. This was addded to" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395923 (https://phabricator.wikimedia.org/T182276) (owner: 10EBernhardson) [05:56:40] 10Operations, 10Release-Engineering-Team (Watching / External), 10User-Joe: Create jenkins job for creating deployment artifacts for `docker-pkg-deploy` - https://phabricator.wikimedia.org/T179562#3818969 (10greg) [06:14:08] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T181779#3818983 (10Marostegui) 05Open>03Resolved All good! Thank you Papaul! ``` root@db2044:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380264FFFB0) Port Nam... [06:18:40] 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3818997 (10RobH) 05Open>03stalled a:03RobH Went ahead and attempted the troubleshooting steps, but no joy: > Hi Rob, > > Has this device been setup before or is this the first time it's powere... [06:19:02] PROBLEM - proxysql processes on wasat is CRITICAL: PROCS CRITICAL: 3 processes with command name proxysql [06:19:48] 10Operations, 10ops-eqsin, 10netops: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3819002 (10RobH) a:03ayounsi [06:20:38] 10Operations, 10ops-eqsin, 10netops: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3794067 (10RobH) As of today, all the networking equipment is in racktables and remotely accessible via mr1-eqsin. Arzhel is working on an issue with one of the two stacking cables... [06:26:02] RECOVERY - proxysql processes on wasat is OK: PROCS OK: 1 process with command name proxysql [06:27:43] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.009 second response time [06:28:42] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.014 second response time [06:32:03] PROBLEM - puppet last run on elastic2012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py] [06:35:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395927 (https://phabricator.wikimedia.org/T174569) [06:37:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395927 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:39:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395927 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:39:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395927 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:40:20] !log Deploy schema change on db1074 (s2) - T174569 [06:40:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1074 - T174569 (duration: 00m 48s) [06:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:31] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:39] (03PS1) 10Marostegui: db-eqiad.php: Pool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395928 (https://phabricator.wikimedia.org/T178359) [06:44:43] (03CR) 10Marostegui: [C: 04-2] "Wait for a few tables to be reimported" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395928 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:45:48] !log Stop replication on db1099:3311 to reimport: change_tag, tag_summary, user and watchlist tables and recompress again - T178359 [06:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:58] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:54:27] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395929 (https://phabricator.wikimedia.org/T178359) [06:57:03] RECOVERY - puppet last run on elastic2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:07] (03PS1) 10Marostegui: db1098.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/395930 (https://phabricator.wikimedia.org/T178359) [07:07:40] (03CR) 10Marostegui: [C: 032] db1098.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/395930 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:15:43] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Pool db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395929 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:17:15] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395929 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:17:26] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395929 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:18:55] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Slowly pool db1098:3316 db1098:3317 - T178359 (duration: 00m 48s) [07:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:06] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:19:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly pool db1098:3316 db1098:3317 - T178359 (duration: 00m 47s) [07:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:34] PROBLEM - MegaRAID on db1068 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [07:34:35] ACKNOWLEDGEMENT - MegaRAID on db1068 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T182288 [07:34:38] 10Operations, 10ops-eqiad: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T182288#3819156 (10ops-monitoring-bot) [07:34:55] Lovely, s4 master [07:35:38] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T182288#3819160 (10Marostegui) p:05Triage>03High @Cmjohnson can we get this replaced as soon as we can? This is s4 master. [07:38:36] (03PS2) 10Marostegui: db-eqiad.php: Pool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395928 (https://phabricator.wikimedia.org/T178359) [08:05:05] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3819169 (10akosiaris) >>! In T169246#3817883, @awight wrote: > Point well taken. What if we temporarily depool some of the servers f... [08:23:25] (03PS1) 10Elukey: Relese version 0.5 [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/395948 [08:23:47] (03CR) 10Elukey: [V: 032 C: 032] Relese version 0.5 [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/395948 (owner: 10Elukey) [08:26:09] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395949 [08:26:26] !log upload prometheus-druid-exporter 0.5-1 to jessie/stretch-wikimedia [08:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:26] !log install prometheus-druid-exporter 0.5 on druid* [08:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:50] this adds support for realtime ingestion metrics --^ [08:29:03] 10Operations, 10Prod-Kubernetes, 10monitoring, 10Kubernetes, and 3 others: Improve monitoring of the Kubernetes clusters - https://phabricator.wikimedia.org/T177395#3819227 (10akosiaris) [08:29:05] 10Operations, 10Prod-Kubernetes, 10monitoring, 10Kubernetes, and 3 others: Gaps in kubelet-reported Prometheus metrics - https://phabricator.wikimedia.org/T181489#3819225 (10akosiaris) 05Open>03Resolved Wooho!!! Fixed. https://grafana.wikimedia.org/dashboard/db/kubernetes-pods?orgId=1&from=now-12h&to=n... [08:29:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395949 (owner: 10Marostegui) [08:31:18] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395949 (owner: 10Marostegui) [08:31:30] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395949 (owner: 10Marostegui) [08:32:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1098:3316 db1098:3317 - T178359 (duration: 00m 48s) [08:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:37] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:36:41] (03PS1) 10Elukey: role::druid::analytics::worker: configure prometheus metrics for mm [puppet] - 10https://gerrit.wikimedia.org/r/395950 [08:38:25] (03CR) 10Elukey: [C: 032] role::druid::analytics::worker: configure prometheus metrics for mm [puppet] - 10https://gerrit.wikimedia.org/r/395950 (owner: 10Elukey) [08:46:16] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395951 [08:48:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395951 (owner: 10Marostegui) [08:50:11] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395951 (owner: 10Marostegui) [08:50:21] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395951 (owner: 10Marostegui) [08:51:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1098:3316 db1098:3317 - T178359 (duration: 00m 48s) [08:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:37] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:55:05] (03PS1) 10Marostegui: db1074.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/395952 [09:00:04] gehel: My dear minions, it's time we take the moon! Just kidding. Time for Logstash upgrade deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171207T0900). [09:00:04] No GERRIT patches in the queue for this window AFAICS. [09:00:28] !log upgrading ELK stack on logstash100* - some log messages might be lost during the upgrade - T178412 [09:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:39] T178412: Upgrade logstash cluster to elastic 5.5.x - https://phabricator.wikimedia.org/T178412 [09:03:17] marostegui: I'd like deploy the ReadingLists extension to production later today (T181107), and jynus said to use the x1 cluster [09:03:17] T181107: Deploy Reading Lists Service to production - https://phabricator.wikimedia.org/T181107 [09:03:28] that would mean the wikishared DB, right? [09:04:02] I didn't find any docs on it, but all the extensions other than Flow seem to use that [09:05:07] !log gehel@tin Started deploy [logstash/plugins@b13d2fa]: (no justification provided) [09:05:10] !log gehel@tin Finished deploy [logstash/plugins@b13d2fa]: (no justification provided) (duration: 00m 02s) [09:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:52] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=logstash1007.eqiad.wmnet [09:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:44] <_joe_> tgr: x1 is the "extra 1" shard, so it's not tied to a specific wiki, yes [09:10:03] <_joe_> if that was your question [09:10:29] <_joe_> tgr: so the reading list service is just a new restbase module + a mw extension, amirite? [09:10:44] _joe_: I'm just not sure what DBs are available there [09:11:00] <_joe_> tgr: I can paste you the list in private somewhere [09:11:22] yes, although the RB module will go into production later so as to not interfere with the Cassandra work [09:11:53] well, I don't care about the list if wikishared is the one I need to use :) [09:12:03] just wasn't sure about that [09:12:10] <_joe_> tgr: yeah I was reading about this now. I'm very happy you didn't create another service where it wasn't needed [09:12:40] <_joe_> sometimes I see services that only work through restbase, do a small lambda on data they ask to rb [09:13:40] <_joe_> so you get the quite horrible antipattern of doing varnish -> rb -> service -> rb -> parsoid -> mediawiki [09:13:49] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=logstash1007.eqiad.wmnet [09:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:16] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=logstash1008.eqiad.wmnet [09:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:26] 10Operations, 10monitoring: Upgrade grafana to 4.6.2 - https://phabricator.wikimedia.org/T182294#3819283 (10akosiaris) [09:18:51] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=logstash1008.eqiad.wmnet [09:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:05] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=logstash1009.eqiad.wmnet [09:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:21] (03CR) 10ArielGlenn: Fix killing dumpers in Wikidata entity dumpers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393923 (owner: 10Hoo man) [09:22:47] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=logstash1009.eqiad.wmnet [09:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:48] tgr: yeah as _joe_ it is a shared one [09:24:18] tgr: What is the impact of deploying it? (I am saying because tomorrow is a public holiday here, so I won't be around neither is jynus) [09:24:41] should be zero, it won't get any load until next year [09:25:38] cool then :) [09:25:43] but I can move the deploy window if you prefer [09:25:45] Are you creating the tables with: IF NOT EXISTS? [09:25:56] tgr: No, no need to, if it won't get any traffic, that is fine [09:26:03] should I? [09:26:23] I was just planning to run mwscript sql.php --wiki=mediawikiwiki --cluster external1 --wikidb wikishared /srv/mediawiki-staging/php-1.31.0-wmf.11/extensions/ReadingLists/sql/readinglists.sql [09:26:34] that's https://phabricator.wikimedia.org/diffusion/ERLS/browse/master/sql/readinglists.sql;0efea27517f8936bb4a572bbde04c4048f5576b9 [09:27:03] tgr: if it is only creating tables on x1, that should not break anything (I mean not using IF NOT EXISTS) [09:27:22] Worst case scenario it will break replication on dbstore1002 and servers like that, not a big deal anyways [09:27:39] It usually happens when tabels are created in core wikis + x1 [09:27:50] but this is not your case, as you are only creating them on x1, so it should be fine :) [09:28:16] cool, thx [09:28:45] !log Upgrade MySQL and kernel on db1074 [09:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:04] (03CR) 10Marostegui: [C: 032] db1074.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/395952 (owner: 10Marostegui) [09:32:15] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395955 [09:32:43] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3819305 (10akosiaris) memtester had repeatedly failed to uncover anything. After some google reading I see that some people have had simular issues using Linux test project (https... [09:33:43] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395955 (owner: 10Marostegui) [09:35:09] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395955 (owner: 10Marostegui) [09:36:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1098:3316 db1098:3317 - T178359 (duration: 00m 52s) [09:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:24] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:36:53] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395955 (owner: 10Marostegui) [09:42:43] PROBLEM - DPKG on ganeti1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:46:03] PROBLEM - SSH on ganeti1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:46:09] ignore ^ [09:46:11] it's me [09:46:18] I maybe managing to trigger the damn issue [09:46:30] (03PS1) 10Gehel: kibana: increase max payload size [puppet] - 10https://gerrit.wikimedia.org/r/395957 (https://phabricator.wikimedia.org/T178412) [09:46:51] !log silence ganeti1006 on icinga T181121 [09:46:58] (03CR) 10jerkins-bot: [V: 04-1] kibana: increase max payload size [puppet] - 10https://gerrit.wikimedia.org/r/395957 (https://phabricator.wikimedia.org/T178412) (owner: 10Gehel) [09:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:03] T181121: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121 [09:47:15] (03PS2) 10Gehel: kibana: increase max payload size [puppet] - 10https://gerrit.wikimedia.org/r/395957 (https://phabricator.wikimedia.org/T178412) [09:47:58] (03CR) 10DCausse: [C: 031] kibana: increase max payload size [puppet] - 10https://gerrit.wikimedia.org/r/395957 (https://phabricator.wikimedia.org/T178412) (owner: 10Gehel) [09:48:00] !log CI: removed Wikidata from configuration, replaced by Wikibase. wmf/* and REL branches are going to be broken though | https://gerrit.wikimedia.org/r/395704 | T181838 [09:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:10] T181838: Mark extension-Wikidata & wikidata-build-resources on Gerrit as ARCHIVED - https://phabricator.wikimedia.org/T181838 [09:48:11] (03CR) 10Gehel: [C: 032] kibana: increase max payload size [puppet] - 10https://gerrit.wikimedia.org/r/395957 (https://phabricator.wikimedia.org/T178412) (owner: 10Gehel) [09:56:52] (03PS1) 10Gehel: kibana: increase max payload size [puppet] - 10https://gerrit.wikimedia.org/r/395958 (https://phabricator.wikimedia.org/T178412) [09:57:22] (03CR) 10DCausse: [C: 031] kibana: increase max payload size [puppet] - 10https://gerrit.wikimedia.org/r/395958 (https://phabricator.wikimedia.org/T178412) (owner: 10Gehel) [09:57:37] (03CR) 10Gehel: [C: 032] kibana: increase max payload size [puppet] - 10https://gerrit.wikimedia.org/r/395958 (https://phabricator.wikimedia.org/T178412) (owner: 10Gehel) [10:01:19] !log upgrade of ELK stack on logstash100* completed - Kibana was unavailable for longer than expected - T178412 [10:01:25] (03PS1) 10Hashar: contint: fix illegal title type Integer -> String [puppet] - 10https://gerrit.wikimedia.org/r/395961 [10:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:30] T178412: Upgrade logstash cluster to elastic 5.5.x - https://phabricator.wikimedia.org/T178412 [10:01:56] (03CR) 10jerkins-bot: [V: 04-1] contint: fix illegal title type Integer -> String [puppet] - 10https://gerrit.wikimedia.org/r/395961 (owner: 10Hashar) [10:03:31] (03PS2) 10Hashar: contint: fix illegal title type Integer -> String [puppet] - 10https://gerrit.wikimedia.org/r/395961 [10:05:10] (03CR) 10Hashar: "That is while using puppet 3.8. I guess it is an error related to the introduction of the future parser." [puppet] - 10https://gerrit.wikimedia.org/r/395961 (owner: 10Hashar) [10:07:48] (03PS3) 10Muehlenhoff: Switch remaining eqiad video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395494 [10:10:27] (03PS1) 10Marostegui: db-eqiad.php: Fully pool db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395963 (https://phabricator.wikimedia.org/T178359) [10:12:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully pool db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395963 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:12:25] !log reboot analytics1003 for kernel+jvm updates - T179943 [10:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:35] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [10:13:27] (03Merged) 10jenkins-bot: db-eqiad.php: Fully pool db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395963 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:13:41] (03CR) 10jenkins-bot: db-eqiad.php: Fully pool db1098:331{6,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395963 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:14:05] (03CR) 10Muehlenhoff: [C: 032] Switch remaining eqiad video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395494 (owner: 10Muehlenhoff) [10:14:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully pool db1098:3316 db1098:3317 - T178359 (duration: 00m 51s) [10:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:46] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:17:03] PROBLEM - Nginx local proxy to apache on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:13] PROBLEM - HHVM rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:43] PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:50] (03PS1) 10Addshore: Move Wikibase dispatchingLockManager to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395966 [10:24:17] <_joe_> checking mw1283 [10:26:59] <_joe_> INFO: task hhvm:23247 blocked for more than 120 seconds. [10:27:03] <_joe_> this is kinda new [10:27:27] <_joe_> that server has big troubles [10:28:02] <_joe_> !log depooling mw1283 for further investigation [10:28:05] (03PS1) 10Addshore: Create a LockManager for WikidataDispatch with short TTL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) [10:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:33] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.099 second response time [10:31:54] RECOVERY - Nginx local proxy to apache on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.050 second response time [10:32:13] RECOVERY - HHVM rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 74905 bytes in 5.160 second response time [10:32:38] (03PS2) 10Addshore: Create a LockManager for WikidataDispatch with short TTL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) [10:34:04] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Cleanup multiple definitions of logstash endpoint in puppet / hiera - https://phabricator.wikimedia.org/T182304#3819484 (10Gehel) [10:34:08] (03PS1) 10Addshore: Log wikibase dispatchChanges script for testwikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/395968 [10:34:31] (03PS2) 10Addshore: Log wikibase dispatchChanges script for testwikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/395968 [10:35:07] (03CR) 10jerkins-bot: [V: 04-1] Log wikibase dispatchChanges script for testwikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/395968 (owner: 10Addshore) [10:35:35] (03PS1) 10Addshore: Use new wikibase dispatch lock manager on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395969 (https://phabricator.wikimedia.org/T178652) [10:37:39] (03PS1) 10Gehel: elasticsearch: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/395970 (https://phabricator.wikimedia.org/T182304) [10:38:39] (03PS3) 10Addshore: Log wikibase dispatchChanges script for testwikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/395968 [10:42:14] (03CR) 10Gehel: "Once more, there is something I don't quite understand about hiera. Puppet compiler does not like this change: https://puppet-compiler.wmf" [puppet] - 10https://gerrit.wikimedia.org/r/395970 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [10:42:24] PROBLEM - Hive Metastore on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore [10:42:44] PROBLEM - Hive Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 [10:43:13] not sure what's hapening, the host went down by itself --^ [10:43:14] checking [10:44:57] I can't use console com2, it doesn't allow me to connect [10:45:40] moritzm: did you execute anything special on an1003? [10:45:43] (03CR) 10Thiemo Mättig (WMDE): contint: fix illegal title type Integer -> String (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395961 (owner: 10Hashar) [10:47:31] so now I can only see System is booting up. See pam_nologin(8) [10:47:38] and can't use the serial console [10:48:03] probably a hard reboot is the only next step? [10:50:32] !log powercycle analytics1003 - no serial console, ssh stuck in System is booting up. See pam_nologin(8) [10:50:41] (03PS1) 10Marostegui: db-eqiad.php: Repool db1074 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395973 [10:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:58] ok host is booting [10:51:07] elukey: no, didn't do anything, but I was still logged into the mgmt due to the earlier reboot problem, logged off now [10:51:39] moritzm: ah snap I thought that the console would have told me an error message rather than not allowing me in [10:51:53] PROBLEM - Host analytics1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:53:13] RECOVERY - Host analytics1003 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [10:54:40] (03CR) 10Hashar: [V: 031] "And Shinken is all happy:" [puppet] - 10https://gerrit.wikimedia.org/r/395961 (owner: 10Hashar) [10:54:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1074 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395973 (owner: 10Marostegui) [10:55:24] PROBLEM - Hive Metastore on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore [10:55:44] PROBLEM - Hive Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 [10:56:06] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1074 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395973 (owner: 10Marostegui) [10:56:23] so mysql is up [10:56:26] but hive is not [10:56:34] working on it [10:56:59] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1074 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395973 (owner: 10Marostegui) [10:57:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1074 with low weight (duration: 00m 48s) [10:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:35] (03CR) 10Hashar: [V: 031] contint: fix illegal title type Integer -> String (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395961 (owner: 10Hashar) [10:57:40] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10hardware-requests: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3819562 (10Marostegui) [11:05:35] !log mobrovac@tin Started deploy [restbase/deploy@097ba7d]: Add CORS headers to erroneous responses as well - T182103 [11:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:46] T182103: 404 responses do not specify CORS headers - https://phabricator.wikimedia.org/T182103 [11:09:00] (03CR) 10Paladox: "I've already done it here https://gerrit.wikimedia.org/r/#/c/394096/ :)" [puppet] - 10https://gerrit.wikimedia.org/r/395961 (owner: 10Hashar) [11:10:59] !log mobrovac@tin Finished deploy [restbase/deploy@097ba7d]: Add CORS headers to erroneous responses as well - T182103 (duration: 05m 24s) [11:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:11] T182103: 404 responses do not specify CORS headers - https://phabricator.wikimedia.org/T182103 [11:16:19] (03PS1) 10Marostegui: db-eqiad.php: Increase API traffic for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395976 [11:18:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase API traffic for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395976 (owner: 10Marostegui) [11:20:10] (03Merged) 10jenkins-bot: db-eqiad.php: Increase API traffic for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395976 (owner: 10Marostegui) [11:20:20] (03CR) 10jenkins-bot: db-eqiad.php: Increase API traffic for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395976 (owner: 10Marostegui) [11:21:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase API traffic for db1074 (duration: 00m 48s) [11:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:55] (03Abandoned) 10Hashar: contint: fix illegal title type Integer -> String [puppet] - 10https://gerrit.wikimedia.org/r/395961 (owner: 10Hashar) [11:22:22] (03CR) 10Hashar: [V: 031 C: 031] "cherry picked on the CI puppet master and that fixed puppet there:" [puppet] - 10https://gerrit.wikimedia.org/r/394096 (owner: 10Paladox) [11:23:17] (03PS1) 10ArielGlenn: move production of lists of last good dumps from snapshot to web server [puppet] - 10https://gerrit.wikimedia.org/r/395977 (https://phabricator.wikimedia.org/T182303) [11:35:12] !log Compress s8 on db1099 - T178359 [11:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:22] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [11:38:21] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395979 [11:40:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395979 (owner: 10Marostegui) [11:41:00] !log reimaging mw2152 to stretch [11:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:52] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395979 (owner: 10Marostegui) [11:42:07] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395979 (owner: 10Marostegui) [11:42:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase API traffic for db1074 (duration: 00m 48s) [11:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:08] (03PS1) 10Elukey: profile::hive::server: set hive port [puppet] - 10https://gerrit.wikimedia.org/r/395980 [11:55:06] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395985 [11:57:44] RECOVERY - Hive Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 [12:00:26] PROBLEM - HHVM rendering on mw2167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:01:17] RECOVERY - HHVM rendering on mw2167 is OK: HTTP OK: HTTP/1.1 200 OK - 74933 bytes in 0.296 second response time [12:05:16] RECOVERY - Hive Metastore on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore [12:05:36] (03Abandoned) 10Zfilipin: Update RuboCop Ruby gem [puppet] - 10https://gerrit.wikimedia.org/r/395522 (https://phabricator.wikimedia.org/T180878) (owner: 10Zfilipin) [12:06:35] hive back to life [12:12:26] marostegui: I'm going to sneak a couple of changes into meidawiki-config before swat if they wont get in your way! [12:12:36] 10Operations, 10Cloud-Services, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 4 others: Grafana reports ALL docker mounts in a spammy way - https://phabricator.wikimedia.org/T177052#3819760 (10hashar) 05Resolved>03Open [12:12:44] I realise you last touched it 30 mins ago, but worth a check.... [12:12:44] 10Operations, 10Cloud-Services, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 4 others: Grafana reports ALL docker mounts in a spammy way - https://phabricator.wikimedia.org/T177052#3645705 (10hashar) 05Open>03Resolved [12:12:49] (03PS1) 10Elukey: role::analytics_cluster::coordinator: remove prometheus jmx config from hive [puppet] - 10https://gerrit.wikimedia.org/r/395989 (https://phabricator.wikimedia.org/T177458) [12:14:06] (03PS2) 10Elukey: role::analytics_cluster::coordinator: remove prometheus jmx config from hive [puppet] - 10https://gerrit.wikimedia.org/r/395989 (https://phabricator.wikimedia.org/T177458) [12:14:40] (03CR) 10Elukey: [C: 032] role::analytics_cluster::coordinator: remove prometheus jmx config from hive [puppet] - 10https://gerrit.wikimedia.org/r/395989 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [12:20:08] addshore: no worries, not planning to deploy anything soon :) [12:24:15] marostegui: thanks! [12:24:31] (03PS2) 10Addshore: Move Wikibase dispatchingLockManager to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395966 [12:24:35] (03CR) 10Addshore: [C: 032] Move Wikibase dispatchingLockManager to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395966 (owner: 10Addshore) [12:25:59] (03Merged) 10jenkins-bot: Move Wikibase dispatchingLockManager to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395966 (owner: 10Addshore) [12:26:48] (03CR) 10jenkins-bot: Move Wikibase dispatchingLockManager to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395966 (owner: 10Addshore) [12:31:11] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:395966|Move Wikibase dispatchingLockManager to InitialiseSettings]] PT 1/2 (duration: 00m 48s) [12:31:21] PROBLEM - HHVM jobrunner on mw2152 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [12:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:32] (03PS3) 10Addshore: Create a LockManager for WikidataDispatch with short TTL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) [12:32:34] !log addshore@tin Synchronized wmf-config/Wikibase.php: [[gerrit:395966|Move Wikibase dispatchingLockManager to InitialiseSettings]] PT 2/2 (duration: 00m 48s) [12:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:22] cool! done! [12:34:10] PROBLEM - Nginx local proxy to apache on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:34:11] PROBLEM - HHVM rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:34:21] PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:34:30] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:34:37] (03CR) 10Addshore: "Hmmm, in theory when deploying this one we should stop the dispatchers first, then switch the lock location and then restart the dispatche" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) (owner: 10Addshore) [12:35:11] PROBLEM - HHVM rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:21] PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:36:00] RECOVERY - Nginx local proxy to apache on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.027 second response time [12:36:10] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 74913 bytes in 0.113 second response time [12:36:10] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.044 second response time [12:38:21] RECOVERY - HHVM jobrunner on mw2152 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.074 second response time [12:43:18] Interesting, as of 21:30 / 21:33 yesterday POST requests to the mediawiki api seem to be taking longer [12:45:20] RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.230 second response time [12:46:01] addshore: hmm for the "edit" requests seems there got slightly worth yesterday around 11am UTC maybe https://grafana.wikimedia.org/dashboard/db/api-requests?refresh=5m&panelId=19&fullscreen&orgId=1&from=now-2d&to=now [12:46:05] looking at the p75 [12:46:25] hashar: https://grafana.wikimedia.org/dashboard/db/api-summary?orgId=1&from=1512592360843&to=1512607232818&var-percentile=p50&var-dc=eqiad [12:46:41] oh nice [12:46:48] people on wikidata are saying the UI is slow, and it uses the api [12:46:58] feels slow to me, cant tell what happened at that time though [12:48:23] the fact it shows slightly in that edit module i guess means it is not just the wikidata api [12:48:26] parsoid got updated? [12:48:30] PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:41] Yeh, but why should that change how the api works? ;) [12:49:49] I'm gonna do some profiling in a sec [12:50:10] RECOVERY - HHVM rendering on mw1314 is OK: HTTP OK: HTTP/1.1 200 OK - 74911 bytes in 0.094 second response time [12:50:21] RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 5.430 second response time [12:50:30] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 5.419 second response time [12:52:15] jouncebot: refresh [12:52:19] I refreshed my knowledge about deployments. [12:52:24] jouncebot: next [12:52:26] In 1 hour(s) and 7 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171207T1400) [12:54:13] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw2152.codfw.wmnet [12:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:39] hashar: https://performance.wikimedia.org/xhgui/run/view?id=5a293968bb85442702d31b5f 11 second request i just made [12:54:56] 75% of which is in the autoloader? [12:55:15] * hashar blames Wikibase and RestBASE and volans [12:55:30] wait, that % is totally off, but, meh [12:55:58] autoloader is only 0.3 seconds, meh [12:57:42] addshore: maybe the flame graph is a little bit easier ? https://performance.wikimedia.org/xhgui/run/flamegraph?id=5a293968bb85442702d31b5f [12:57:54] supposedly that would highlight what is taking a while [12:58:13] hashar: indeed [12:58:23] so https://performance.wikimedia.org/xhgui/run/flamegraph?id=5a293968bb85442702d31b5f is a slow request, https://performance.wikimedia.org/xhgui/run/flamegraph?id=5a293a70bb85442702389e43 is fast [12:58:32] although, the api request itself is essentially the same [12:59:28] hmm 112 seconds?? really [12:59:40] no 11.2 seconds [12:59:44] pff [12:59:54] thats actually one of the worst I have seen [13:00:02] on https://performance.wikimedia.org/xhgui/run/flamegraph?id=5a293968bb85442702d31b5f [13:00:14] I read for main() -- 112,742,243.00 us [13:00:46] 112, lol what [13:00:54] well everything is slow on that one [13:00:58] so maybe the server was overbusy [13:01:07] <_joe_> so that's probably a request that somehow hit an appserver in a bad state? [13:01:15] <_joe_> do you have that info? lemme see [13:01:34] _joe_: I hit mwdebug1002 with those 2 profiled requests [13:01:36] there are plenty of Hooks::run() call not doing any further invocation but still taking ~ 7 seconds [13:01:43] so imho https://performance.wikimedia.org/xhgui/run/flamegraph?id=5a293968bb85442702d31b5f should be discarded [13:01:46] <_joe_> addshore: oh, uhm [13:01:49] unless that is something that happens often [13:02:03] something definitely feels off but i cant put my finger on it [13:02:07] <_joe_> lemme take a look [13:03:03] hashar: I actually think the flamegraphs might have an extra 0 for some reason, as the speedy flamegraph i posted above returned in 1.3 seconds, and the flamegraph apparently says 13seconds.. [13:03:14] ahhh [13:03:20] that is annoying :((( [13:03:22] might be worth filing a bug for this :P [13:03:25] *that [13:03:52] the call graph shows 11 secs yes [13:05:40] <_joe_> I'm looking at the apache logs [13:05:48] <_joe_> and either the timestamp is wrong [13:06:07] <_joe_> or I think it reports quite different numbers [13:06:43] <_joe_> no, ok, here it is [13:06:51] <_joe_> 11.8 seconds [13:07:17] so, the slow responses isnt in the network anywhere, its in mediawiki.... [13:07:25] <_joe_> well you made 5 requests to the api at that time, those servers are quite underpowered, that might be a reason [13:07:37] <_joe_> addshore: well it could be anything that mediawiki calls, too [13:07:41] hmm let me profile soe on some other servers [13:08:07] _joe_: indeed, i thought the profile might indicate something clearly taking an age, but apparently tno [13:08:08] *not [13:09:02] <_joe_> addshore: something taking an age could be a db not responding quickly, or any issue within hhvm, or even connections to memcached, you name it [13:09:17] yup :( [13:09:25] <_joe_> if we don't see some form of consistent performance behaviour, that we can investigate, I can't tell you more [13:10:11] Well, I get the same slow responses when not just targetting mwdebug1002, although can't profile those, but could look at which servers they hit [13:10:17] <_joe_> so, if I'm not needed anymore, I'd be back later! [13:10:39] <_joe_> if you get the same slow responses from multiple servers, I'd look behind mediawiki first [13:10:54] * _joe_ stares at the databases [13:11:42] <_joe_> but it could even be some bug in our code, of course; that would exclude the random malfunction and indicate something more systematic [13:12:07] <_joe_> addshore: care to open a ticket about this? I gtg now and it's clearly not an emergency [13:12:13] _joe_: will do [13:12:24] <_joe_> but with repro steps I can try to poke at the problem :) [13:12:35] * _joe_ afk [13:35:24] PROBLEM - Apache HTTP on mw2217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:14] RECOVERY - Apache HTTP on mw2217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.108 second response time [13:39:59] !log reimaging mw2246 to stretch [13:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:02] (03PS4) 10Zfilipin: alswiki: Set wgRestrictDisplayTitle = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) (owner: 10EddieGP) [13:56:10] Oops, I forgot to switch branches before committing a change to core. https://gerrit.wikimedia.org/r/#/c/396007/1 [13:56:14] Should I do something to fix that? [13:56:45] Jhs: hmm, switched branches? it is on master! where did you want it to end up? [13:57:08] addshore, the docs say that you should do git checkout -B T before committing [13:57:20] hmm, so that just sets the topic [13:57:32] so the topic is currently set to T156589 [13:57:32] T156589: The native language name for [se] Northern Sami should be changed from "sámegiella" to "davvisámegiella" - https://phabricator.wikimedia.org/T156589 [13:57:38] you can change that in the UI if you want ;) [13:57:47] ah [13:58:40] done :) [13:59:55] <_joe_> Jhs: just fyi https://phabricator.wikimedia.org/T177498 [14:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171207T1400). [14:00:05] eddiegp: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] <_joe_> it's not gonna be years before we have a newer ICU :) [14:00:14] I can swat today [14:00:18] <_joe_> I hope it's good news :) [14:00:21] eddiegp: around for swat? [14:00:22] o/ [14:00:45] zeljkof: Yeah, I'm here :) [14:00:50] eddiegp: I'll ping you in a few minutes when the patch is at mwdebug1002 [14:01:26] PROBLEM - Apache HTTP on mw2200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:49] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) (owner: 10EddieGP) [14:02:18] RECOVERY - Apache HTTP on mw2200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.119 second response time [14:02:58] eddiegp: jenkins is a bit busy, it might be a few minutes until the patch is merged [14:03:25] As long as it doesn't fail this time, heh ;) [14:03:35] what happened the last time? [14:03:57] I saw that thcipriani.afk tried to merge it already [14:03:59] _joe_, yup, am aware, but have no idea about timeline :) [14:04:07] That was evening swat yesterday, CI broke completely, I waited for an hour and rescheduled then. [14:04:18] _joe_, and good news indeed :) [14:04:23] <_joe_> Jhs: I hope it will happen early next year [14:04:36] cool [14:04:39] eddiegp: no trouble with CI so far today... [14:05:18] and hashar is around, in case of emergency :) [14:05:24] <_joe_> I underline *hope* there - don't be upset if it happens a bit later :P [14:05:43] (03Merged) 10jenkins-bot: alswiki: Set wgRestrictDisplayTitle = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) (owner: 10EddieGP) [14:05:51] _joe_, hehe, no probs [14:06:22] eddiegp: the patch is at mwdebug1002, please test and let me know if I can deploy it [14:06:56] zeljkof: Tested, works. [14:07:11] (03CR) 10jenkins-bot: alswiki: Set wgRestrictDisplayTitle = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) (owner: 10EddieGP) [14:07:26] eddiegp: deploying... [14:08:17] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:395835|alswiki: Set wgRestrictDisplayTitle = false (T182154)]] (duration: 00m 49s) [14:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:27] T182154: Configuration change for als.wikipedia.org: Set wgRestrictDisplayTitle = false - https://phabricator.wikimedia.org/T182154 [14:08:38] eddiegp: deployed, please check and thanks for deploying with #releng! ;) [14:08:53] no more patches for SWAT? [14:09:28] !log EU SWAT finished [14:09:29] zeljkof: Works without mwdebug now too, thanks for deploying. :) [14:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:48] eddiegp: no problem, I am glad I could help [14:14:27] Out of courosity i have a patch scheduled for morning swat but its just cleaning up throttle.php do i really need to be around for that? [14:15:07] Zppix: You're around now, how about deploying it now? It's EU swat right now ;) [14:15:19] I can be around then too [14:15:53] Not to mention i cant do it rn as im on mobile xD [14:26:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395985 (owner: 10Marostegui) [14:28:16] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395985 (owner: 10Marostegui) [14:28:27] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395985 (owner: 10Marostegui) [14:29:18] (03PS1) 10Ottomata: Fix mylvmbackup for analytics-meta [puppet] - 10https://gerrit.wikimedia.org/r/396010 [14:29:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1074 (duration: 00m 47s) [14:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:30] PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:30] PROBLEM - HHVM rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:50] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:32] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/9224/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/396010 (owner: 10Ottomata) [14:33:37] (03CR) 10Ottomata: [C: 031] "Sounds sane, elukey, whatcha think?" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395923 (https://phabricator.wikimedia.org/T182276) (owner: 10EBernhardson) [14:34:10] (03CR) 10Ottomata: [C: 032] Fix mylvmbackup for analytics-meta [puppet] - 10https://gerrit.wikimedia.org/r/396010 (owner: 10Ottomata) [14:34:47] Zppix: you have to be in this channel during SWAT [14:34:58] Ok [14:37:21] RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.024 second response time [14:37:21] RECOVERY - HHVM rendering on mw1314 is OK: HTTP OK: HTTP/1.1 200 OK - 74889 bytes in 0.097 second response time [14:55:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db1099:331 with low weight - T178359 (duration: 00m 47s) [14:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:45] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [14:57:19] (03CR) 10Elukey: "Some comments plus there is the Jenkins -1 stuff to fix :)" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [15:04:31] (03PS12) 10Ottomata: Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) [15:05:18] (03CR) 10jerkins-bot: [V: 04-1] Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [15:06:19] (03PS13) 10Ottomata: Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) [15:09:49] (03PS14) 10Ottomata: Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) [15:11:50] (03CR) 10Elukey: [C: 031] "LGTM https://puppet-compiler.wmflabs.org/compiler02/9227/thorium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [15:12:10] yeehaw [15:12:30] (03CR) 10Ottomata: [C: 032] Puppetization for superset (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [15:16:54] (03PS1) 10Muehlenhoff: Refresh valgrind.patch [debs/openssl] - 10https://gerrit.wikimedia.org/r/396020 [15:17:45] (03PS1) 10Giuseppe Lavagetto: [WiP] Create an envoy docker image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/396021 [15:18:56] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396022 (https://phabricator.wikimedia.org/T178359) [15:18:57] !log otto@tin Started deploy [analytics/superset/deploy@f0f5adf]: initial deployment [15:18:59] !log otto@tin Finished deploy [analytics/superset/deploy@f0f5adf]: initial deployment (duration: 00m 02s) [15:19:05] (03CR) 10Muehlenhoff: [C: 032] Update to 1.0.2n [debs/openssl] - 10https://gerrit.wikimedia.org/r/396013 (owner: 10Muehlenhoff) [15:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:24] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 47 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[analytics/superset/deploy] [15:20:36] (03CR) 10Muehlenhoff: [C: 032] Refresh valgrind.patch [debs/openssl] - 10https://gerrit.wikimedia.org/r/396020 (owner: 10Muehlenhoff) [15:21:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396022 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:23:22] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396022 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:23:33] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396022 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:24:24] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:24:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1099:3311 weight - T178359 (duration: 00m 48s) [15:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:50] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [15:26:43] (03PS1) 10Ottomata: Fix superset hiera admin_password var name [puppet] - 10https://gerrit.wikimedia.org/r/396024 [15:27:03] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:24] (03CR) 10Ottomata: [C: 032] Fix superset hiera admin_password var name [puppet] - 10https://gerrit.wikimedia.org/r/396024 (owner: 10Ottomata) [15:27:33] PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:53] PROBLEM - HHVM rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:31] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[analytics/superset/deploy] [15:29:41] trying to run hhvm-dump-debug on mw1314 [15:30:32] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1314.eqiad.wmnet [15:30:40] PROBLEM - Check systemd state on thorium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:40] RECOVERY - Check systemd state on thorium is OK: OK - running: The system is fully operational [15:39:47] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396026 (https://phabricator.wikimedia.org/T178359) [15:42:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396026 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:42:12] !log hhvm-dump-debug for mw1314 saved to /tmp/hhvm.17991.bt. [15:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:41] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396026 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:44:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1099:3311 weight - T178359 (duration: 00m 48s) [15:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:48] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [15:46:57] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396026 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:50:10] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.049 second response time [15:50:31] RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.820 second response time [15:50:50] RECOVERY - HHVM rendering on mw1314 is OK: HTTP OK: HTTP/1.1 200 OK - 74937 bytes in 0.178 second response time [15:51:55] 10Operations, 10Continuous-Integration-Config, 10Incident-20160126-WikimediaDomainRedirection, 10Regression, 10Wikimedia-Incident: operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801#3820379 (10thiemowmde) 05Open>03Invalid [15:51:57] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#3820380 (10thiemowmde) [15:54:53] (03PS1) 10Marostegui: db-eqiad.php: Repool db1093 as main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396030 (https://phabricator.wikimedia.org/T178359) [15:57:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1093 as main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396030 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:58:16] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#3820401 (10Legoktm) [15:58:18] 10Operations, 10Continuous-Integration-Config, 10Incident-20160126-WikimediaDomainRedirection, 10Regression, 10Wikimedia-Incident: operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801#3820399 (10Legoktm) 05Invalid>03Open This hasn't been fixed yet. [16:00:39] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1093 as main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396030 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [16:00:49] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1093 as main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396030 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [16:01:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1093 back as main traffic in s6 - T178359 (duration: 00m 48s) [16:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:01] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [16:03:02] !log uploaded openssl 1.0.2n for jessie-wikimedia to apt.wikimedia.org [16:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:33] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331#3820428 (10Halfak) [16:05:49] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331#3820428 (10awight) One thing that @akosiaris pointed out, we'll want to replace this puppet formula: > $processes = $::processorcount * $workers_per_core and specify the num... [16:07:14] (03PS1) 10Herron: puppetmaster: change puppetmaster1002 puppet major version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/396036 (https://phabricator.wikimedia.org/T177254) [16:07:25] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396037 [16:10:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396037 (owner: 10Marostegui) [16:12:14] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396037 (owner: 10Marostegui) [16:12:23] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3820532 (10Halfak) In this case, it's advanced smoke testing for the cluster. I'm hesitant to deploy in production until we've thoro... [16:12:27] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396037 (owner: 10Marostegui) [16:13:14] !log upgrading puppetmaster1002 to puppet 4 [16:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:52] (03PS2) 10Chad: Gerrit 2.14.6 [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120) [16:14:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1099:3311 weight - T178359 (duration: 00m 48s) [16:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:09] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [16:16:00] (03CR) 10Herron: [C: 032] puppetmaster: change puppetmaster1002 puppet major version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/396036 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [16:28:34] (03PS1) 10Elukey: Fix metric rendering [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/396051 [16:29:29] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:34:22] (03PS2) 10Elukey: Fix metric rendering [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/396051 [16:36:13] (03PS3) 10Elukey: Fix metric rendering [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/396051 [16:41:42] (03PS1) 10Awight: Refactor ORES uWSGI workers to use an absolute count [puppet] - 10https://gerrit.wikimedia.org/r/396055 (https://phabricator.wikimedia.org/T182249) [16:43:35] Hey operations, Icinga died... probably will need a restart just letting yall know [16:47:02] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3820612 (10RStallman-legalteam) @Cmjohnson, Vinitha has signed the NDA and it's on file in our contracts software. Thank you! [16:47:41] (03CR) 10Herron: [C: 032] puppetmaster: add proxypassmatch rules for puppet 4 url variants [puppet] - 10https://gerrit.wikimedia.org/r/395832 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [16:47:47] (03PS2) 10Herron: puppetmaster: add proxypassmatch rules for puppet 4 url variants [puppet] - 10https://gerrit.wikimedia.org/r/395832 (https://phabricator.wikimedia.org/T177254) [16:51:18] Zppix: Icinga works for me? [16:51:38] moritzm: i meant ircecho [16:53:07] ah, ok. It might be intentionally disabled since Keith is working on the Puppet update [16:53:20] herron: ^ ? [16:53:46] no not intentionally disabled by me, looks like the process is running but can try bouncing it [16:54:19] ok it's been restarted [16:54:52] Ty [16:55:28] np [16:55:33] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3820633 (10Lydia_Pintscher) Wikibase is supposed to be its own product. We are going to push for more use of it outside Wikimedia in 2018. It... [16:57:15] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3820638 (10Addshore) >>! In T99531#3820633, @Lydia_Pintscher wrote: > Wikibase is supposed to be its own product. We are going to push for mo... [17:00:04] godog, moritzm, and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171207T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:01:10] (03PS1) 10Herron: puppetmaster: change puppetmaster1001 puppet major version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/396058 (https://phabricator.wikimedia.org/T177254) [17:05:43] (03CR) 10Halfak: [C: 04-1] "This will affect production." [puppet] - 10https://gerrit.wikimedia.org/r/396055 (https://phabricator.wikimedia.org/T182249) (owner: 10Awight) [17:06:25] (03PS4) 10Elukey: Allow metric to have the same name and different labels [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/396051 [17:08:59] !log temporarily disabling all puppet agents during puppetmaster1001 (puppet ca) upgrade to puppet 4 [17:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:11] (03PS5) 10Elukey: List of fixes: [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/396051 [17:13:36] !log upgrading puppetmaster1001 to puppet 4 [17:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:48] (03CR) 10Herron: [C: 032] puppetmaster: change puppetmaster1001 puppet major version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/396058 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [17:13:52] \o/ [17:16:49] Is SWAT still happening? [17:16:59] jouncebot: now [17:16:59] For the next 0 hour(s) and 43 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171207T1700) [17:17:04] Yep [17:17:32] (03PS21) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [17:17:34] Zppix: That's not that helpful. I know what time it is. What I was wondering is whether people are still here. [17:17:37] (03PS9) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) [17:18:09] !log milimetric@tin Started deploy [analytics/aqs/deploy@4ec13b4]: (no justification provided) [17:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:45] (03PS1) 10Halfak: Bumps stresstest web workers_per_core from 2 to 6. [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) [17:19:27] This backport needs deploying, but it was made so late I didn't have time to schedule it: https://gerrit.wikimedia.org/r/#/c/396050/ [17:19:50] I can put it on there now, so I hope someone is around to deploy it. [17:21:15] Deskana: ah sorry i didnt understand if you were asking what window we were in or mot [17:21:26] Deskana: you can probably ask no_justification to deploy it before he does the MW train [17:21:31] (03PS6) 10Elukey: List of fixes: [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/396051 [17:21:42] Deskana: {{doing}} [17:23:16] Anyone doing a puppetswat? I didn't get it on the list but we're still in the window and it's beta-only :) [17:25:09] !log puppetmaster1001 upgraded to puppet 4. re-enabling puppet agents across the fleet [17:25:12] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1314.eqiad.wmnet [17:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:38] !log milimetric@tin Finished deploy [analytics/aqs/deploy@4ec13b4]: (no justification provided) (duration: 07m 28s) [17:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:59] could https://gerrit.wikimedia.org/r/#/c/393052/ also be done? (I didn't expect to be around for the deploy) [17:26:44] Zackary: That's not a deployed extension [17:26:52] So it doesn't need anyone here [17:27:23] oh [17:27:46] https://gerrit.wikimedia.org/r/#/c/392999/ then? [17:28:11] Gave you a +2 on 393052 though :) [17:28:40] thanks [17:30:03] That $wgLocalVirtualHosts one is a little weird. I'm not entirely sure what that setting does :p [17:30:32] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [17:30:54] Makes MW decide if it's "local" [17:31:00] if ( in_array( $domain, $wgLocalVirtualHosts ) ) { [17:31:00] return true; [17:31:00] } [17:31:11] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [17:31:31] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [17:32:25] Reedy: What...does that matter? [17:32:55] no_justification: Basically... It seems to be whether MW will try to use a proxy to access that wiki [17:33:00] !log demon@tin Synchronized php-1.31.0-wmf.11/extensions/VisualEditor/lib/ve: Ief480487, Deskana made me do it (duration: 00m 49s) [17:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:25] Do we actually do any cross wiki http requests from php? [17:33:31] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=codfwvar-cache_type=Allvar-status_type=5 [17:34:11] I'm not seeing anything in the MW logs immediately to explain the 5xx spike ^ [17:34:22] Reedy: Nothing would surprise me :p [17:34:26] no_justification: That did the trick. Thanks! [17:34:29] trudat [17:34:46] single spike, seems gone https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?orgId=1&var-site=All&var-cache_type=All&var-status_type=5&from=now-3h&to=now [17:35:04] no clear issue from https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X [17:35:23] * no_justification nods [17:36:32] PROBLEM - Nginx local proxy to apache on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:43] PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:43] PROBLEM - HHVM rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:50] (03PS7) 10Elukey: List of fixes: [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/396051 [17:38:32] mw1281 is not feeling well https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=mw1281 [17:39:31] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=codfwvar-cache_type=Allvar-status_type=5 [17:39:31] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [17:39:32] need to step away from keyboard now but if it doesn't recover by itself please depool/investigate it --^ [17:39:32] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [17:40:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [17:46:31] RECOVERY - Nginx local proxy to apache on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 7.055 second response time [17:46:41] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.032 second response time [17:46:42] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 74960 bytes in 0.104 second response time [17:54:32] (03PS2) 10Halfak: Refactor web workers for ORES [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) [17:55:54] (03PS3) 10Halfak: Refactor web workers for ORES [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) [17:57:04] (03PS4) 10Halfak: Refactor web workers for ORES [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) [17:57:16] (03PS1) 10Herron: puppet: re-pool puppetmaster1001 as puppet.ulsfo.wmnet [dns] - 10https://gerrit.wikimedia.org/r/396070 (https://phabricator.wikimedia.org/T177254) [17:58:39] (03CR) 10DCausse: elasticsearch: use the canonical definition of logstash host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395970 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [17:59:28] (03CR) 10Herron: [C: 032] puppet: re-pool puppetmaster1001 as puppet.ulsfo.wmnet [dns] - 10https://gerrit.wikimedia.org/r/396070 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [17:59:54] (03PS1) 10Chad: group2 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396071 [17:59:56] (03CR) 10Chad: [C: 04-2] group2 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396071 (owner: 10Chad) [18:00:05] cscott, arlolra, subbu, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171207T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:19] !log re-pooling eqiad puppet 4 masters as puppet.ulsfo.wnet [18:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:35] !log mholloway-shell@tin Started deploy [mobileapps/deploy@2fa32ed]: Update mobileapps to 71f581c [18:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:41] (03PS1) 10Gehel: service: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/396072 (https://phabricator.wikimedia.org/T182304) [18:04:19] (03CR) 10jerkins-bot: [V: 04-1] service: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/396072 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [18:06:11] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@2fa32ed]: Update mobileapps to 71f581c (duration: 05m 36s) [18:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:27] (03PS4) 10Gehel: elasticsearch: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/395970 (https://phabricator.wikimedia.org/T182304) [18:17:00] Can someone double-check my edit here: https://wikitech.wikimedia.org/w/index.php?title=Puppet_coding&diff=1777679&oldid=1775400 [18:18:17] (03CR) 10Gehel: "puppet compiler still happy: https://puppet-compiler.wmflabs.org/compiler02/9228/" [puppet] - 10https://gerrit.wikimedia.org/r/395970 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [18:18:56] (03PS5) 10Awight: Refactor web workers for ORES [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) (owner: 10Halfak) [18:20:48] (03PS2) 10Dzahn: ganglia: delete views for kafkatee, hadoop, varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/395890 (https://phabricator.wikimedia.org/T177225) [18:21:12] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 30 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:21:31] (03PS1) 10Muehlenhoff: Add .gitreview file [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/396075 [18:22:13] (03CR) 10Dzahn: [C: 032] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/395890 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:23:49] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add .gitreview file [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/396075 (owner: 10Muehlenhoff) [18:24:05] (03CR) 10Halfak: [C: 031] "Confirmed that this hard-coding of EQIAD scb* nodes matches the current uwsgi counts on those nodes and all CODFW scb* nodes run 32 uwsgi " [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) (owner: 10Halfak) [18:24:07] (03PS1) 10Dzahn: Revert "oxygen: revert removing ganglia" [puppet] - 10https://gerrit.wikimedia.org/r/396076 [18:24:43] (03PS2) 10Dzahn: Revert "oxygen: revert removing ganglia" [puppet] - 10https://gerrit.wikimedia.org/r/396076 [18:24:55] (03CR) 10Awight: [C: 031] Refactor web workers for ORES [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) (owner: 10Halfak) [18:26:07] (03CR) 10Dzahn: [C: 032] Revert "oxygen: revert removing ganglia" [puppet] - 10https://gerrit.wikimedia.org/r/396076 (owner: 10Dzahn) [18:26:12] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 12 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:26:38] (03PS5) 10Gehel: elasticsearch: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/395970 (https://phabricator.wikimedia.org/T182304) [18:28:15] (03CR) 10Gehel: [C: 032] elasticsearch: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/395970 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [18:28:58] grrr, still a puppet issue on oxygen.. but i am looking [18:29:30] ACKNOWLEDGEMENT - puppet last run on oxygen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn WIP - dzahn - rm ganglia [18:32:44] 10Operations, 10Analytics, 10Analytics-Cluster: stat1004 - /mnt/hdfs is not accessible - https://phabricator.wikimedia.org/T182342#3820871 (10Dzahn) [18:33:11] ACKNOWLEDGEMENT - Disk space on stat1004 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Transport endpoint is not connected daniel_zahn https://phabricator.wikimedia.org/T182342 [18:33:41] PROBLEM - mediawiki-installation DSH group on mw2246 is CRITICAL: Host mw2246 is not in mediawiki-installation dsh group [18:35:18] 10Operations, 10Continuous-Integration-Config, 10Incident-20160126-WikimediaDomainRedirection, 10Regression, 10Wikimedia-Incident: operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801#3820885 (10Dzahn) This is correct, i believe it's still a valid ToD... [18:37:51] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/392399 (owner: 10Hashar) [18:37:56] (03PS1) 10Herron: re-pool puppetmaster1001 as puppet.(eqiad.wmnet|wikimedia.org) [dns] - 10https://gerrit.wikimedia.org/r/396078 (https://phabricator.wikimedia.org/T177254) [18:42:14] why could i not find any of that code? ..SUBmodules.. grrroar [18:44:52] (03PS1) 10Herron: puppet: remove (cleanup) hiera regex used for puppet 4 validation [puppet] - 10https://gerrit.wikimedia.org/r/396079 (https://phabricator.wikimedia.org/T177254) [18:45:05] how do i clone the kafkatee module? [18:46:27] got it :) [18:50:10] mutante: I have 2 mediawiki config patches that I need to coordinate with some puppet patches (disable cron, wait for scripts to stop, update config, enable cron) x2 [18:50:34] would you have time in the next SWAT window to help me out? or am I best to do them the other way around and do it during a puppet swat? [18:54:11] addshore: most days i would say yes, but today i'll have to pick the puppet swat way. in the middle of something [18:54:29] ack, thats fine! thanks :) [19:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171207T1900). [19:00:05] Zppix, Lucas_WMDE, and stephanebisson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:34] I'm here [19:00:35] Dont tempt me jouncebot [19:00:47] I’m here [19:03:44] Well i guess we wait for a swatter [19:05:09] I can SWAT [19:05:26] (03PS3) 10Thcipriani: Rm all past throttle overrides in throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395918 (owner: 10Zppix) [19:06:05] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395918 (owner: 10Zppix) [19:07:12] (03PS1) 10Dzahn: logging/kafkatee: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/396086 (https://phabricator.wikimedia.org/T177225) [19:08:02] (03Merged) 10jenkins-bot: Rm all past throttle overrides in throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395918 (owner: 10Zppix) [19:08:13] (03CR) 10jenkins-bot: Rm all past throttle overrides in throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395918 (owner: 10Zppix) [19:09:05] Zppix: I'll go ahead and deploy and skip mwdebug for ^ unless there's anything you want to check [19:09:39] thcipriani: i mean unless you have any concerns thats fine [19:09:51] (03CR) 10Dzahn: [C: 032] logging/kafkatee: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/396086 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:10:07] Zppix: no concerns, going live :) [19:10:22] Thanks! :) [19:10:45] * Zppix hopes i dont break everything [19:11:46] If there is room, I'd like to add something to SWAT [19:11:49] not testable [19:12:16] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:395918|Rm all past throttle overrides in throttle.php]] (duration: 00m 48s) [19:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:33] ^ Zppix all live, thank you for the cleanup! [19:12:40] thcipriani: np [19:12:57] Amir1: there's probably room [19:13:16] !log ppchelko@tin Started deploy [changeprop/deploy@3c4f51d]: Long awaited deploy: generic optimizations, gc metric, delay reporting [19:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:24] (03PS3) 10Thcipriani: Remove obsolete WikibaseQualityConstraints settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392449 (owner: 10Lucas Werkmeister (WMDE)) [19:13:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392449 (owner: 10Lucas Werkmeister (WMDE)) [19:13:58] thcipriani: ill stick around just incase though o/ [19:14:13] ok, thanks :) [19:14:20] (03PS1) 10Dzahn: kafkatee: remove Ganglia monitoring class and script [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/396088 (https://phabricator.wikimedia.org/T177225) [19:14:31] !log ppchelko@tin Finished deploy [changeprop/deploy@3c4f51d]: Long awaited deploy: generic optimizations, gc metric, delay reporting (duration: 01m 15s) [19:14:32] PROBLEM - Nginx local proxy to apache on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:51] PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:52] PROBLEM - HHVM rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:52] PROBLEM - HHVM rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:21] PROBLEM - Apache HTTP on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:32] PROBLEM - Nginx local proxy to apache on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:36] (03PS1) 1020after4: Bump scap version to 3.7.4-1 [puppet] - 10https://gerrit.wikimedia.org/r/396089 [19:16:06] (03PS1) 10Ladsgroup: Start description usage tracking for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396090 (https://phabricator.wikimedia.org/T106287) [19:16:12] (03Merged) 10jenkins-bot: Remove obsolete WikibaseQualityConstraints settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392449 (owner: 10Lucas Werkmeister (WMDE)) [19:16:22] thcipriani: https://gerrit.wikimedia.org/r/396090 [19:16:29] I update the deployment page [19:16:37] thanks for the merge thcipriani :) [19:16:38] ok, thanks [19:16:51] what's happening [19:16:58] (03CR) 10jenkins-bot: Remove obsolete WikibaseQualityConstraints settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392449 (owner: 10Lucas Werkmeister (WMDE)) [19:17:02] Lucas_WMDE: sure, it's live on mwdebug1002, if you want to check it [19:17:14] why are API Apaches so busy suddenly? [19:18:29] mutante: i tend to notice that socket timeout isnt always bad sometimes the icinga check just times out [19:18:30] mutante: unknown from SWAT side, the only thing that has been deployed is removing an outdated throttle rule [19:18:41] thcipriani: not sure if there’s anything specific I can test… constraint checks still work, that’s good enough for me [19:19:20] Lucas_WMDE: ok thanks, I'm going to give mutante a little time to investigate and I'll continue SWAT after that. [19:19:29] mutante: ^ assumes you want some time to investigate [19:19:32] :) [19:22:25] thcipriani: not really, i can't look at it now, doesn't seem like an outage [19:22:40] mutante: ok, thanks [19:22:44] Lucas_WMDE: going live [19:24:45] !log thcipriani@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:392449|Remove obsolete WikibaseQualityConstraints settings]] (duration: 00m 48s) [19:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:54] ^ Lucas_WMDE live now [19:25:03] thanks, checking [19:25:21] everything still seems to work! thanks again [19:25:49] yw :) [19:26:13] stephanebisson: you change is live on mwdebug1002, check please [19:26:20] on it [19:26:54] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396090 (https://phabricator.wikimedia.org/T106287) (owner: 10Ladsgroup) [19:27:24] thcipriani: works as expected [19:27:30] ok, going live [19:28:25] (03Merged) 10jenkins-bot: Start description usage tracking for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396090 (https://phabricator.wikimedia.org/T106287) (owner: 10Ladsgroup) [19:28:36] (03CR) 10jenkins-bot: Start description usage tracking for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396090 (https://phabricator.wikimedia.org/T106287) (owner: 10Ladsgroup) [19:29:43] !log thcipriani@tin Synchronized php-1.31.0-wmf.11/includes/specialpage/ChangesListSpecialPage.php: SWAT: [[gerrit:396054|WLFilters: Correctly check if RCFilters should be enabled on WL]] T182318 (duration: 00m 48s) [19:29:50] ^ stephanebisson live everywhere [19:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:53] T182318: New filters for edit review on Watchlist can't be opt-out on wikis - https://phabricator.wikimedia.org/T182318 [19:30:13] thcipriani: lgtm, thanks! [19:30:25] awesome. yw :) [19:31:14] Amir1: yours is untestable on mwdebug you said, correct? [19:33:49] thcipriani: yes [19:33:55] ok, going live [19:35:00] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:396090|Start description usage tracking for commonswiki]] T106287 (duration: 00m 48s) [19:35:06] ^ Amir1 live now [19:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:09] T106287: [Tracking] Track descriptions usages separately (Create a new description usage aspect "D") - https://phabricator.wikimedia.org/T106287 [19:35:16] thcipriani: Thanks! [19:35:19] yw :) [19:36:14] (03PS1) 10ArielGlenn: Revert "Wikidata weekly json and rdf dumps disabled temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/396097 [19:36:38] (03CR) 10Hoo man: [C: 031] Revert "Wikidata weekly json and rdf dumps disabled temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/396097 (owner: 10ArielGlenn) [19:37:14] (03PS2) 10ArielGlenn: Revert "Wikidata weekly json and rdf dumps disabled temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/396097 [19:38:16] (03CR) 10ArielGlenn: [C: 032] Revert "Wikidata weekly json and rdf dumps disabled temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/396097 (owner: 10ArielGlenn) [19:40:03] !log joal@tin Started deploy [analytics/refinery@bd9c6cc]: Regualr analytics deploy - Long time no see, deployment :) [19:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:42] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3821059 (10Dzahn) Is this unstalled now? The reason was while T168246 is ongoing but that ticket is resolved. Is it really resolved though? [19:41:41] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3414945 (10Dzahn) Is the stress test over? Then T165170 is probably unstalled now. Is it not over yet? Then maybe this ticket shou... [19:48:52] RECOVERY - Nginx local proxy to apache on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.253 second response time [19:48:52] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 74909 bytes in 0.087 second response time [19:48:55] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3821071 (10Pchelolo) After a day of running the jobs for wiktionaries I don't see any issues at all, but on the contrary I don't really see any deduplication -... [19:52:01] PROBLEM - Nginx local proxy to apache on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:02] PROBLEM - HHVM rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:18] !log joal@tin Finished deploy [analytics/refinery@bd9c6cc]: Regualr analytics deploy - Long time no see, deployment :) (duration: 12m 15s) [19:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:11] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[analytics/refinery] [20:00:04] no_justification: Time to snap out of that daydream and deploy MediaWiki train. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171207T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:11] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:00:48] (03PS1) 10Dzahn: site: add ores200[19] as spare systems [puppet] - 10https://gerrit.wikimedia.org/r/396101 (https://phabricator.wikimedia.org/T177225) [20:01:36] go away jouncebot [20:01:37] i don't wanna deploy rn [20:01:37] * no_justification makes a sandwich instead [20:02:32] Poor jouncebot its just doing its job [20:02:45] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3821108 (10awight) @Dzahn sorry--we decided to test some more, to overcome a suspiciously low performance ceiling. I'll make the fol... [20:04:19] Zppix: I had it on /ignore for weeks :p [20:05:47] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3821132 (10awight) 05Resolved>03Open Reopening until we finish with {T182249}. [20:06:20] no_justification: so it was just trying to log in what caused the issue wrt. globalblocking? [20:06:47] I mean it's not really an issue, it's a pretty uncommon error, relatively [20:06:52] Mostly just filing the task and following up [20:06:55] Definitely wmf.11 related [20:09:33] (03PS2) 10Dzahn: site: add ores200[19] as spare systems [puppet] - 10https://gerrit.wikimedia.org/r/396101 (https://phabricator.wikimedia.org/T177225) [20:12:47] !log joal@tin Started deploy [analytics/refinery@3e52903]: Regular analytics deploy - Long time no see, deployment :) - post-patch [20:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:26] (03PS3) 10Dzahn: site: add ores200[19] as spare systems [puppet] - 10https://gerrit.wikimedia.org/r/396101 (https://phabricator.wikimedia.org/T177225) [20:13:54] (03CR) 10Dzahn: [C: 032] site: add ores200[19] as spare systems [puppet] - 10https://gerrit.wikimedia.org/r/396101 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:14:08] !log joal@tin Finished deploy [analytics/refinery@3e52903]: Regular analytics deploy - Long time no see, deployment :) - post-patch (duration: 01m 20s) [20:14:15] (03PS5) 10Dzahn: site: add ores200[19] as spare systems [puppet] - 10https://gerrit.wikimedia.org/r/396101 (https://phabricator.wikimedia.org/T177225) [20:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:19] Um wasn't service deploy like an hour ago? [20:14:31] Re: refinery? [20:15:17] This is train window... [20:17:03] !log joal@tin Started deploy [analytics/refinery@53bd630]: Regular analytics deploy - Long time no see, deployment :) - post-patch-2 (hopefully last for tonight) [20:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:14] Sorry for the spam ops-team [20:18:07] joal: ^ to no_justification [20:18:07] !log re-pooling eqiad puppet 4 masters via dns puppet.eqiad.wmnet puppet.wikimedia.org [20:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:43] (03CR) 10Herron: [C: 032] re-pool puppetmaster1001 as puppet.(eqiad.wmnet|wikimedia.org) [dns] - 10https://gerrit.wikimedia.org/r/396078 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [20:19:12] mutante: If you’re looking at ORES stuff… https://gerrit.wikimedia.org/r/#/c/396064/ [20:19:14] joal: I don't care about the spam, that's scap logging like it's supposed to. I'm curious why the log messages are appearing at all--it's my deploy window right now [20:19:37] awight: sorry, i'm not. i just need all nodes to have roles so i can kill ganglia [20:19:42] RECOVERY - Nginx local proxy to apache on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.032 second response time [20:19:43] and ores2* didnt have one [20:19:54] mutante: :D cool, no rush on that anyway [20:19:56] no_justification: I have no clue at all [20:19:57] which means it doesnt get a few things, incl. firewall [20:20:40] (03PS1) 10Ottomata: Configure LDAP proxy and authentication for Superset [puppet] - 10https://gerrit.wikimedia.org/r/396103 (https://phabricator.wikimedia.org/T166689) [20:20:43] joal: it deployed itself? [20:21:34] (03CR) 10jerkins-bot: [V: 04-1] Configure LDAP proxy and authentication for Superset [puppet] - 10https://gerrit.wikimedia.org/r/396103 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [20:21:36] (03PS2) 10Ottomata: Configure LDAP proxy and authentication for Superset [puppet] - 10https://gerrit.wikimedia.org/r/396103 (https://phabricator.wikimedia.org/T166689) [20:21:55] greg-g: almost, canary done, 99% chances the rest follows :) [20:22:10] (03CR) 10jerkins-bot: [V: 04-1] Configure LDAP proxy and authentication for Superset [puppet] - 10https://gerrit.wikimedia.org/r/396103 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [20:22:23] (03PS3) 10Ottomata: Configure LDAP proxy and authentication for Superset [puppet] - 10https://gerrit.wikimedia.org/r/396103 (https://phabricator.wikimedia.org/T166689) [20:22:24] !log joal@tin Finished deploy [analytics/refinery@53bd630]: Regular analytics deploy - Long time no see, deployment :) - post-patch-2 (hopefully last for tonight) (duration: 05m 21s) [20:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:41] joal: I think you're missing the point. Please don't deploy during others deploy window. It's serious. [20:22:52] PROBLEM - Nginx local proxy to apache on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:34] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3821194 (10herron) [20:24:11] greg-g: I did miss the point - I apologize for that [20:24:11] no_justification: My apologizes to you as well [20:25:15] greg-g: had not been informed about deployment schedules in analytics - It's the first time over many deploys that I got anyone mentioning this [20:25:38] (03PS4) 10Ottomata: Configure LDAP proxy and authentication for Superset [puppet] - 10https://gerrit.wikimedia.org/r/396103 (https://phabricator.wikimedia.org/T166689) [20:26:56] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Icinga: eventlog2001 - CRITICAL status of defined EventLogging jobs - https://phabricator.wikimedia.org/T119930#3821198 (10Dzahn) Is there a ticket to get eventlog2001 back into production? It is in site.pp but doesn't have any roles. Adding it with ro... [20:27:31] no_justification: greg-g, we've never had any conflict with deploy windows before [20:27:40] just because we use scap doesn't mean we opt into deploy windows...does it? [20:28:02] Well, all things that are like service deploys should either have a window [20:28:02] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Icinga: eventlog2001 - CRITICAL status of defined EventLogging jobs - https://phabricator.wikimedia.org/T119930#1840331 (10Ottomata) > Is there a ticket to get eventlog2001 back into production? It never was in production. [20:28:05] Or at least ask [20:28:16] its not a service [20:28:21] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Icinga: eventlog2001 - CRITICAL status of defined EventLogging jobs - https://phabricator.wikimedia.org/T119930#3821203 (10Dzahn) So it should be decom'ed? [20:28:43] Point being, it's a deployment of some non-puppet / non-dns software. Ideally someone at least says "Hey, cool if I deploy XYZ" right now? [20:28:47] it is also not a 'production' thing [20:29:04] (03PS1) 10Dzahn: site: add eventlog2001 as spare::system [puppet] - 10https://gerrit.wikimedia.org/r/396104 (https://phabricator.wikimedia.org/T119930) [20:29:09] * no_justification rolls eyes [20:29:11] Still not the point [20:29:16] It's a /deployment/ [20:29:25] no_justification: who would we ask? there is no one but analytics that would care about this [20:29:36] i'd even prefer if the scap didn't log certain analytics deployments in this channel [20:29:47] Oh, I dunno, just asking out loud to anyone in the channel. [20:29:55] "Hey, I'm gonna deploy XYZ that cool?" [20:30:23] yeah, we were all talking about it over in the analytics channel [20:30:48] RECOVERY - Nginx local proxy to apache on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.030 second response time [20:30:51] Which I'm in but I don't read unless I'm pinged or asking a question [20:30:54] But whatever, forget it [20:31:03] why would you want to know about rsyncing of some .jar files? [20:31:12] !log restart hhvm on mw1281 - hhvm stuck (hhvm-dump-debug timing out) [20:31:17] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.039 second response time [20:31:17] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 74911 bytes in 0.151 second response time [20:31:20] Because I care about everything that's deployed. That's kind of releng's job. [20:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:41] (03CR) 10Chad: [C: 032] group2 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396071 (owner: 10Chad) [20:33:09] (03Merged) 10jenkins-bot: group2 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396071 (owner: 10Chad) [20:33:11] (03PS2) 10Dzahn: site: add eventlog2001 as spare::system [puppet] - 10https://gerrit.wikimedia.org/r/396104 (https://phabricator.wikimedia.org/T119930) [20:33:23] (03CR) 10jenkins-bot: group2 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396071 (owner: 10Chad) [20:33:54] no_justification: I'm happy to shout in ops-chan abou me deploying analytics stuff - I was just anware it would interest anyone [20:34:53] joal: Thanks (and no need to apologize, you didn't know). I just think it's best practice when deploying in non-emergency situations to check first with $someone :) [20:35:05] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Switch Cloud VPS puppet default to future parser - https://phabricator.wikimedia.org/T179451#3821225 (10Paladox) I think this was done with https://gerrit.wikimedia.org/r/#/c/392172/ ? [20:35:55] !log restart hhvm on mw1235 - hhvm-dump-debug hanging out, not stacktrace available [20:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:04] understood no_justification - We do check internally in analytics, I'll add to the docs to also ping ops chan [20:36:09] (03CR) 10Dzahn: [C: 032] site: add eventlog2001 as spare::system [puppet] - 10https://gerrit.wikimedia.org/r/396104 (https://phabricator.wikimedia.org/T119930) (owner: 10Dzahn) [20:36:14] Not even because it'll impact me, or touches my work or anything. Mostly just so we can keep a running tab of what's changing when :) [20:36:17] (03PS3) 10Dzahn: site: add eventlog2001 as spare::system [puppet] - 10https://gerrit.wikimedia.org/r/396104 (https://phabricator.wikimedia.org/T119930) [20:36:33] Mostly in case something goes $wrong, there's less variables to consider [20:36:44] "Was it X or Y that broke?" kind of things [20:36:45] :) [20:36:47] works for me no_justification [20:36:50] <3 [20:36:53] :) [20:37:27] RECOVERY - Nginx local proxy to apache on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.130 second response time [20:37:28] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 74911 bytes in 0.821 second response time [20:37:36] anyhow, I'm done deploying for tonight normally no_justification - no more message from me (I think) [20:38:20] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.043 second response time [20:39:32] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.11 [20:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:28] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3821231 (10bd808) [20:40:28] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Switch Cloud VPS puppet default to future parser - https://phabricator.wikimedia.org/T179451#3821229 (10bd808) 05Open>03Resolved a:03Andrew [20:40:30] joal: General rule of thumb too, 20:00-22:00 UTC Tues/Wed/Thurs are MW deploys to the sites, not the best time for other deploys because releng's eyes are there, and it's potentially a huge impact :) [20:41:34] no_justification: Well heard - It's true that it's not my regular deploy time [20:41:36] (03PS5) 10Ottomata: Configure LDAP proxy and authentication for Superset [puppet] - 10https://gerrit.wikimedia.org/r/396103 (https://phabricator.wikimedia.org/T166689) [20:41:38] (03CR) 10Reedy: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395687 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [20:41:54] (03CR) 10Reedy: [C: 031] Enable ReadingLists on all SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395688 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [20:42:17] (03CR) 10jerkins-bot: [V: 04-1] Configure LDAP proxy and authentication for Superset [puppet] - 10https://gerrit.wikimedia.org/r/396103 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [20:43:40] (03PS1) 10Dzahn: bastionhost, mw_rc_irc,backup::offsite,pybaltest: rm ganglia [puppet] - 10https://gerrit.wikimedia.org/r/396106 (https://phabricator.wikimedia.org/T177225) [20:45:11] (03CR) 10Ottomata: [V: 032 C: 032] "Wow that's a crazy style warning:" [puppet] - 10https://gerrit.wikimedia.org/r/396103 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [20:45:42] (03PS6) 10Ottomata: Configure LDAP proxy and authentication for Superset [puppet] - 10https://gerrit.wikimedia.org/r/396103 (https://phabricator.wikimedia.org/T166689) [20:45:42] (03CR) 10Ottomata: [V: 032 C: 032] Configure LDAP proxy and authentication for Superset [puppet] - 10https://gerrit.wikimedia.org/r/396103 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [20:47:47] (03PS2) 10Dzahn: bastionhost, mw_rc_irc,backup::offsite,pybaltest: rm ganglia [puppet] - 10https://gerrit.wikimedia.org/r/396106 (https://phabricator.wikimedia.org/T177225) [20:48:59] awight: Spotted in production: [{exception_id}] {exception_url} MWException from line 140 of /srv/mediawiki/php-1.31.0-wmf.11/includes/Preferences.php: Global default 'soft' is invalid for field rcOresDamagingPref [20:49:21] no_justification: Nooo I was just reading your exceptional email [20:50:55] (03CR) 10Dzahn: [C: 032] bastionhost, mw_rc_irc,backup::offsite,pybaltest: rm ganglia [puppet] - 10https://gerrit.wikimedia.org/r/396106 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:57:40] (03PS1) 10Ottomata: Add superset.wikimedia.org DYNA/CNAME [dns] - 10https://gerrit.wikimedia.org/r/396124 (https://phabricator.wikimedia.org/T166689) [20:58:11] (03CR) 10Ottomata: [C: 032] Add superset.wikimedia.org DYNA/CNAME [dns] - 10https://gerrit.wikimedia.org/r/396124 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [21:01:14] (03PS1) 10Herron: Revert "puppet: temporarily allow puppetcompiler1001 to fetch all catalogs" [puppet] - 10https://gerrit.wikimedia.org/r/396126 [21:01:23] (03PS1) 10Ottomata: Add LVS for superset.wikimedia.org -> thorium [puppet] - 10https://gerrit.wikimedia.org/r/396127 [21:03:56] (03PS2) 10Ottomata: Add misc cache route for superset.wikimedia.org -> thorium [puppet] - 10https://gerrit.wikimedia.org/r/396127 (https://phabricator.wikimedia.org/T166689) [21:06:39] (03CR) 10Dzahn: "imho it makes more sense if directors are named after the service and not after host names that happen to be backends right now" [puppet] - 10https://gerrit.wikimedia.org/r/396127 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [21:07:33] (03CR) 10Dzahn: "but i see that is already used for a bunch of others, so ignore that comment for this patch" [puppet] - 10https://gerrit.wikimedia.org/r/396127 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [21:09:27] (03PS2) 10Herron: Revert "puppet: temporarily allow puppetcompiler1001 to fetch all catalogs" [puppet] - 10https://gerrit.wikimedia.org/r/396126 [21:10:44] (03PS3) 10Herron: Revert "puppet: temporarily allow puppetcompiler1001 to fetch all catalogs" [puppet] - 10https://gerrit.wikimedia.org/r/396126 [21:12:40] (03CR) 10Herron: [C: 032] Revert "puppet: temporarily allow puppetcompiler1001 to fetch all catalogs" [puppet] - 10https://gerrit.wikimedia.org/r/396126 (owner: 10Herron) [21:14:21] (03PS2) 10Herron: puppet: remove (cleanup) hiera regex used for puppet 4 validation [puppet] - 10https://gerrit.wikimedia.org/r/396079 (https://phabricator.wikimedia.org/T177254) [21:14:51] (03PS1) 10Dzahn: lvs::balancer: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/396129 (https://phabricator.wikimedia.org/T177225) [21:15:37] (03CR) 10Herron: [C: 032] puppet: remove (cleanup) hiera regex used for puppet 4 validation [puppet] - 10https://gerrit.wikimedia.org/r/396079 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [21:16:02] (03CR) 10BBlack: [C: 031] Add misc cache route for superset.wikimedia.org -> thorium [puppet] - 10https://gerrit.wikimedia.org/r/396127 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [21:16:06] (03PS2) 10Dzahn: lvs::balancer: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/396129 (https://phabricator.wikimedia.org/T177225) [21:16:55] (03PS1) 10Herron: Revert "puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx" [puppet] - 10https://gerrit.wikimedia.org/r/396130 [21:17:16] (03CR) 10jerkins-bot: [V: 04-1] Revert "puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx" [puppet] - 10https://gerrit.wikimedia.org/r/396130 (owner: 10Herron) [21:18:03] (03CR) 10Dzahn: [C: 032] lvs::balancer: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/396129 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [21:18:50] hmm, i wonder how to remove the 'special hosts' like vl1001-eth1.lvs1005.wikimedia.org [21:18:58] (from ganglia) [21:22:28] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3821378 (10herron) [21:23:11] (03Abandoned) 10Herron: Revert "puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx" [puppet] - 10https://gerrit.wikimedia.org/r/396130 (owner: 10Herron) [21:23:24] (03PS3) 10Ottomata: Add misc cache route for superset.wikimedia.org -> thorium [puppet] - 10https://gerrit.wikimedia.org/r/396127 (https://phabricator.wikimedia.org/T166689) [21:23:43] (03CR) 10Ottomata: [V: 032 C: 032] Add misc cache route for superset.wikimedia.org -> thorium [puppet] - 10https://gerrit.wikimedia.org/r/396127 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [21:26:49] (03PS1) 10Herron: puppetdb: remove (cleanup) ferm allow for puppetcompiler1001 [puppet] - 10https://gerrit.wikimedia.org/r/396132 (https://phabricator.wikimedia.org/T177254) [21:27:27] (03CR) 10Herron: [C: 032] puppetdb: remove (cleanup) ferm allow for puppetcompiler1001 [puppet] - 10https://gerrit.wikimedia.org/r/396132 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [21:27:35] (03PS2) 10Herron: puppetdb: remove (cleanup) ferm allow for puppetcompiler1001 [puppet] - 10https://gerrit.wikimedia.org/r/396132 (https://phabricator.wikimedia.org/T177254) [22:00:05] tgr: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Reading Infrastructure . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171207T2200). [22:00:05] No GERRIT patches in the queue for this window AFAICS. [22:01:17] * TheresNoTime pats jouncebot [22:01:17] that was a good attempt [22:09:06] oh wow, jouncebot got an upgrade [22:09:34] not sure if it's an improvement... [22:09:51] The humour? [22:09:55] it's not new :P [22:13:15] it never did that for SWAT, at least [22:17:30] (03PS4) 10Gergő Tisza: Deploy ReadingLists to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395687 (https://phabricator.wikimedia.org/T181107) [22:18:03] (03PS2) 10Gergő Tisza: Enable ReadingLists on all SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395688 (https://phabricator.wikimedia.org/T181107) [22:43:57] wait, no [22:43:57] sql.php runs all the LoadExtensionSchemaUpdates hooks? [22:43:57] that's just scary [22:43:59] (03PS1) 10Tjones: Updates to enable short URLs for transliteration for crhwiki [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) [22:44:51] and now I remember filing a bug about that: T157651 [22:44:51] T157651: Weird behavior of sql.php on beta - https://phabricator.wikimedia.org/T157651 [22:45:23] PROBLEM - Apache HTTP on mw2209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:10] RECOVERY - Apache HTTP on mw2209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.117 second response time [22:55:00] Reedy: https://gerrit.wikimedia.org/r/#/c/396286/ [22:55:16] on a scale of 1 to 10, how horrible is that? [22:55:59] "Hooks::register( 'Load" seems to find nothing in our hosted extensions at least [22:56:56] lol [23:04:39] well, it seems to work [23:04:52] !log ran mwscript ../../../home/tgr/sql.php --wiki=mediawikiwiki --cluster extension1 --wikidb wikishared /srv/mediawiki-staging/php-1.31.0-wmf.11/extensions/ReadingLists/sql/readinglists.sql [23:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:59] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [23:06:09] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [23:07:30] (03CR) 10Gergő Tisza: [C: 032] Deploy ReadingLists to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395687 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [23:08:50] (03Merged) 10jenkins-bot: Deploy ReadingLists to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395687 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [23:09:00] (03CR) 10jenkins-bot: Deploy ReadingLists to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395687 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [23:11:28] (03PS2) 10Tjones: Updates to enable short URLs for transliteration for crhwiki [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) [23:13:10] does it make sense to test an extension deployment on mwdebug1002? [23:13:31] I think it will just mess up the i18n cache? [23:13:57] you can test it without i18n [23:14:04] Just scap it [23:14:38] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Pre-deployment_testing_in_production seems to try to say something on the topic but I can't tell what [23:14:55] scap as in scap pull? [23:15:48] full scap is scap all the things [23:15:51] ie don't bother pulling it to the machine [23:16:45] but then I'm deploying it everywhere, not just the debug hosts [23:16:50] not that there is much to test, I suppose [23:17:22] Usually, you'd do it the other way then [23:17:27] Scap without enabling it [23:17:43] yeah, didn't think of that [23:17:47] Then enable it, scap pull onto mwdebug [23:17:47] test that way [23:17:56] but it's only enabled on testwiki so meh [23:22:43] !log tgr@tin Started scap: T181107 deploy ReadingLists to testwiki [23:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:53] T181107: Deploy Reading Lists Service to production - https://phabricator.wikimedia.org/T181107 [23:25:48] (03PS1) 10Dzahn: db2011: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/396290 [23:26:43] (03CR) 10Dzahn: [C: 032] db2011: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/396290 (owner: 10Dzahn) [23:26:51] (03PS2) 10Dzahn: db2011: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/396290 [23:28:57] (03PS3) 10Dzahn: db2011: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/396290 (https://phabricator.wikimedia.org/T177225) [23:32:14] (03CR) 10Dzahn: "clean puppet run without issues" [puppet] - 10https://gerrit.wikimedia.org/r/396290 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [23:35:56] (03CR) 10Mholloway: [C: 031] Enable ReadingLists on all SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395688 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [23:47:32] !log tgr@tin Finished scap: T181107 deploy ReadingLists to testwiki (duration: 24m 44s) [23:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:42] T181107: Deploy Reading Lists Service to production - https://phabricator.wikimedia.org/T181107 [23:54:06] (03CR) 10Gergő Tisza: [C: 032] Enable ReadingLists on all SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395688 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [23:54:41] !log ran mwscript extensions/ReadingLists/maintenance/populateProjectsFromSiteMatrix.php --wiki=testwiki [23:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:13] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T182288#3819156 (10Cmjohnson) disk is replaced and is rebuilding Enclosure Device ID: 32 Slot Number: 0 Drive's position: DiskGroup: 0, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 0 WWN: 5000C50005643B18 Seque... [23:55:31] (03Merged) 10jenkins-bot: Enable ReadingLists on all SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395688 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [23:56:43] (03CR) 10jenkins-bot: Enable ReadingLists on all SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395688 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [23:59:49] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: T181107 enable ReadingLists on all wikis (duration: 00m 46s)